Welcome to Benchpress

alternate text

Contents:

Documentation Status https://badge.fury.io/py/benchpress.svg

Quick Start

Fire up your terminal, and:

# Install using PyPi
pip install benchpress --user

# Make the Benchpress binaries available
export PATH=$PATH:$HOME/.local/bin

Specify what to benchmark by implementing a Python script that generates commands:

import benchpress as bp
from benchpress.suite_util import BP_ROOT

scripts = [
    ('X-ray',  'xraysim',  ["10*10*1", "20*10*1"]),
    ('Bean',   'galton_bean_machine',  ["10000*10", "20000*10"]),
]

cmd_list = []
for label, name, sizes in scripts:
    for size in sizes:
        full_label = "%s/%s" % (label, size)
        bash_cmd = "python {root}/benchmarks/{script}/python_numpy/{script}.py {size}" \
                    .format(root=BP_ROOT, script=name, size=size)
        cmd_list.append(bp.command(bash_cmd, full_label))

# Finally, we build the Benchpress suite, which is written to `--output`
bp.create_suite(cmd_list)

And run the script:

$ python suites/simple_example.py --output my_benchmark.json
Scheduling 'X-ray/10*10*1': 'python xraysim/python_numpy/xraysim.py --size=10*10*1'
Scheduling 'X-ray/20*10*1': 'python xraysim/python_numpy/xraysim.py --size=20*10*1'
Scheduling 'Bean/10000*10': 'python galton_bean_machine/python_numpy/galton_bean_machine.py --size=10000*10'
Scheduling 'Bean/20000*10': 'python galton_bean_machine/python_numpy/galton_bean_machine.py --size=20000*10'
Writing suite file: my_benchmark.json

The result is a JSON file results.json that encapsulate the commands that make up the benchmark suite. Now, use bp-run to run the benchmark suite:

$bp-run results.json
Executing 'X-ray/10*10*1'
Executing 'X-ray/20*10*1'
Executing 'Bean/10000*10'
Executing 'Bean/20000*10'

Finally, let’s visualize the results in ASCII:

$bp-cli results.json
X-ray/10*10*1: [0.013303, 0.013324, 0.012933] 0.0132 (0.0002)
X-ray/20*10*1: [0.108884, 0.105319, 0.105392] 0.1065 (0.0017)
Bean/10000*10: [0.002653, 0.002553, 0.002616] 0.0026 (0.0000)
Bean/20000*10: [0.005149, 0.005088, 0.005271] 0.0052 (0.0001)

Or as a bar chart:

$bp-chart results.json --output results.pdf
Writing file 'results.pdf' using format 'pdf'.
https://raw.githubusercontent.com/bh107/benchpress/master/doc/source/_static/quickstart_results.png

Installation

Benchpress is distributed via PyPi and Github, which means that you can install it from a Python package (from PyPi) or directly from an unpacked tarball or git-clone from Github.com.

Note

Benchpress is designed to work, with minimal friction, in an environment where the user has limited system permissions. Such as shared computing environments, clusters and supercomputers. A system-wide installation of Benchpress is therefore untested. However, it should work if write permission is assigned to Benchpress users for the benchmarks folder.

From pypi.python.org

The following shows how to do a user-mode / local installation:

pip install benchpress --user

Extend your $PATH, such that the binaries (bp-run, bp-run, bp-cli, bp-chart) are readily available:

export PATH=$PATH:$HOME/.local/bin

When you are done using Benchpress, purging it from your system is as easy as:

pip uninstall benchpress

From clone or tarball

Clone the repos:

git clone https://github.com/bh107/benchpress.git
cd benchpress

or download and unpack a tarball:

wget https://github.com/bh107/benchpress/archive/master.zip
unzip master.zip
cd bohrium-benchpress

And install it:

pip install . --user

When developing Benchpress use PyPi’s developer installation:

pip install . –user -e

The Suite File

The suite file contains the set of commands to execute and the results of said executions. In Python, it is easy to create a suite file. You start by creating a list of dict where each dict specify a command to execute such as:

[{
  'label': 'X-ray/10*10*1', # The label of the command
  'cmd': 'python xraysim.py 10*10*1',  # The command to execute
  'env': {}  # The environment variables that will be defined before execution
}, {
  'label': 'X-ray/20*10*1',
  'cmd': 'python xraysim.py 20*10*1',
  'env': {}
}]

And then you call benchpress.benchpress.create_suite(), which writes the suite file at the location specified with the command line argument --output.

JSON schema

The suite format is defined in json-schema:

A example of a suite file that contains two commands; one finished and one pending:

{
    "creation_date_utc": "2017-05-23T09:02:25.373696",
    "cmd_list": [
        {
            "cmd": "python xraysim.py 10*10*1",
            "jobs": [
                {
                    "status": "finished",
                    "warmup": true,
                    "results": [
                        {
                            "stderr": "",
                            "success": true,
                            "stdout": "elapsed-time: 0.013244\n"
                        }
                    ],
                    "nruns": 1
                }
            ],
            "env": {},
            "label": "X-ray/10*10*1"
        },
        {
            "cmd": "python xraysim.py 20*10*1",
            "jobs": [
                {
                    "status": "pending",
                    "warmup": true,
                    "nruns": 1
                }
            ],
            "env": {},
            "label": "X-ray/20*10*1"
        }
    ]
}

Usage - Commands

Benchpress provides a set of binaries for running and visualize the suite file The Suite File.

bp-run

After you have created a benchmark suite file (see. The Suite File), you can then use bp-run to execute the suite:

usage: run.py [-h] [--nruns NRUNS] [--warmup] [--dirty] [--tag TAG] [--slurm]
              [--partition PARTITION] [--multi-jobs] [--nice NICE]
              suite

Runs a benchmark suite and stores the results in a JSON-file.

positional arguments:
  suite                 Path to the JSON file where the benchmark results will
                        be read and written. If the file exist, the benchmark
                        will resume.

optional arguments:
  -h, --help            show this help message and exit
  --nruns NRUNS         How many times should each command run.
  --warmup              Execute one warm up run, before the measured runs
  --dirty               Do no clean up.
  --tag TAG             Assign a tag to the result.

SLURM Queuing System:
  --slurm               Use the SLURM queuing system.
  --partition PARTITION
                        Submit to a specific SLURM partition.
  --multi-jobs          Submit 'nruns' SLURM jobs instead of one job with
                        'nruns' number of runs.
  --nice NICE           The scheduling priority - range is from -10000
                        (highest priority) to 10000 (lowest priority) where
                        zero is default. Only privileged users can specify a
                        negative priority.

bp-cli

At any point, you can read the result of the finished executions within a benchmark suite file. bp-cli prints the benchmark results to the terminal:

usage: cli.py [-h] [-o FILE] [--parse-regex RegEx] [--py-type {float,int,str}]
              [--labels-to-include RegEx] [--labels-to-exclude RegEx]
              [--label-map RegEx:label,...,RegEx:label] [--csv]
              [--csv-separator sep]
              FILE

Prints the result of a Benchpress JSON-file.

positional arguments:
  FILE                  JSON file containing results

optional arguments:
  -h, --help            show this help message and exit
  -o FILE, --output FILE
                        Write output to FILE. (default: None)
  --parse-regex RegEx   How to parse the result of each run. For each RegEx
                        match, group one is recorded as a result. (default:
                        elapsed-time: ([\d.]+))
  --py-type {float,int,str}
                        The Python data type of the parsed results. (default:
                        float)
  --labels-to-include RegEx
                        All labels that match the RegEx are showed. (default:
                        .*)
  --labels-to-exclude RegEx
                        All labels that match the RegEx are ignored. (default:
                        None)
  --label-map RegEx:label,...,RegEx:label
                        Comma separated list of original-to-new-label names
                        (default: None)
  --csv                 Use the CSV format using 'separator' as the separator.
                        (default: False)
  --csv-separator sep   Use the CSV format using 'sep' as the separator.
                        (default: ,)

bp-cli-series

usage: cli_series.py [-h] [-o FILE] [--parse-regex RegEx]
                     [--py-type {float,int,str}] [--labels-to-include RegEx]
                     [--labels-to-exclude RegEx]
                     [--label-map RegEx:label,...,RegEx:label]
                     [--csv-separator sep]
                     [--meta-key {creation_date_utc,tag}]
                     FILE_LIST [FILE_LIST ...]

Prints the result of a series of Benchpress JSON-files.

positional arguments:
  FILE_LIST             JSON file containing results (accept multiple files)

optional arguments:
  -h, --help            show this help message and exit
  -o FILE, --output FILE
                        Write output to FILE. (default: None)
  --parse-regex RegEx   How to parse the result of each run. For each RegEx
                        match, group one is recorded as a result. (default:
                        elapsed-time: ([\d.]+))
  --py-type {float,int,str}
                        The Python data type of the parsed results. (default:
                        float)
  --labels-to-include RegEx
                        All labels that match the RegEx are showed. (default:
                        .*)
  --labels-to-exclude RegEx
                        All labels that match the RegEx are ignored. (default:
                        None)
  --label-map RegEx:label,...,RegEx:label
                        Comma separated list of original-to-new-label names
                        (default: None)
  --csv-separator sep   Use the CSV format using 'sep' as the separator.
                        (default: ,)
  --meta-key {creation_date_utc,tag}
                        The meta key, which value vary throughout the series
                        (default: creation_date_utc)

bp-chart

usage: bar_per_cmd.py [-h] [-o FILE] [--parse-regex RegEx]
                      [--py-type {float,int,str}] [--labels-to-include RegEx]
                      [--labels-to-exclude RegEx]
                      [--label-map RegEx:label,...,RegEx:label]
                      [--title TITLE] [--warmups WARMUPS] [--ymax YMAX]
                      [--ymin YMIN] [--ylog] [--ylabel YLABEL]
                      [--xticklabel-rotation DEGREE] [--no-legend]
                      [--fontsize FONTSIZE]
                      [--pyplot-style {seaborn-darkgrid,Solarize_Light2,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,_classic_test,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn,seaborn-poster,seaborn-bright,seaborn-muted,seaborn-paper,seaborn-white,fast,seaborn-pastel,seaborn-dark,tableau-colorblind10,seaborn-dark-palette}]
                      [--mpld3]
                      FILE

Plots the result of a Benchpress JSON-file (one bar per command)

positional arguments:
  FILE                  JSON file containing results

optional arguments:
  -h, --help            show this help message and exit
  -o FILE, --output FILE
                        Write output to FILE. (default: None)
  --parse-regex RegEx   How to parse the result of each run. For each RegEx
                        match, group one is recorded as a result. (default:
                        elapsed-time: ([\d.]+))
  --py-type {float,int,str}
                        The Python data type of the parsed results. (default:
                        float)
  --labels-to-include RegEx
                        All labels that match the RegEx are showed. (default:
                        .*)
  --labels-to-exclude RegEx
                        All labels that match the RegEx are ignored. (default:
                        None)
  --label-map RegEx:label,...,RegEx:label
                        Comma separated list of original-to-new-label names
                        (default: None)
  --title TITLE         The title of the chart. (default: None)
  --warmups WARMUPS     Specify the amount of samples from warm-up rounds.
                        (default: 0)
  --ymax YMAX           Max value of the y-axis (default: None)
  --ymin YMIN           Min value of the y-axis (default: None)
  --ylog                Makes the y-axis logarithmic (default: False)
  --ylabel YLABEL       Label on the y-axis (default: Elapsed time in seconds)
  --xticklabel-rotation DEGREE
                        The rotation of the labels on the x-axis (default: 90)
  --no-legend           Hide the legend box (default: False)
  --fontsize FONTSIZE   Fontsize (default: 12)
  --pyplot-style {seaborn-darkgrid,Solarize_Light2,seaborn-notebook,classic,seaborn-ticks,grayscale,bmh,seaborn-talk,dark_background,ggplot,fivethirtyeight,_classic_test,seaborn-colorblind,seaborn-deep,seaborn-whitegrid,seaborn,seaborn-poster,seaborn-bright,seaborn-muted,seaborn-paper,seaborn-white,fast,seaborn-pastel,seaborn-dark,tableau-colorblind10,seaborn-dark-palette}
                        The matplotlib.pyplot.style to use (default: seaborn-
                        darkgrid)
  --mpld3               Generate mpld3 plots <https://mpld3.github.io>
                        (default: False)

bp-chart-series

usage: series_per_cmd.py [-h] [-o FILE] [--parse-regex RegEx]
                         [--py-type {float,int,str}]
                         [--labels-to-include RegEx]
                         [--labels-to-exclude RegEx]
                         [--label-map RegEx:label,...,RegEx:label]
                         [--title TITLE] [--warmups WARMUPS] [--ymax YMAX]
                         [--ymin YMIN] [--ylog] [--ylabel YLABEL]
                         [--plot-size PLOT_SIZE]
                         [--xticklabel-rotation DEGREE] [--no-legend]
                         [--fontsize FONTSIZE] [--pyplot-style PYPLOT_STYLE]
                         [--meta-key {creation_date_utc,tag}] [--mpld3]
                         FILE_LIST [FILE_LIST ...]

Plots the result of a Benchpress JSON-file (one bar per command)

positional arguments:
  FILE_LIST             JSON file containing results (accept multiple files)

optional arguments:
  -h, --help            show this help message and exit
  -o FILE, --output FILE
                        Write output to FILE. (default: None)
  --parse-regex RegEx   How to parse the result of each run. For each RegEx
                        match, group one is recorded as a result. (default:
                        elapsed-time: ([\d.]+))
  --py-type {float,int,str}
                        The Python data type of the parsed results. (default:
                        float)
  --labels-to-include RegEx
                        All labels that match the RegEx are showed. (default:
                        .*)
  --labels-to-exclude RegEx
                        All labels that match the RegEx are ignored. (default:
                        None)
  --label-map RegEx:label,...,RegEx:label
                        Comma separated list of original-to-new-label names
                        (default: None)
  --title TITLE         The title of the chart. (default: None)
  --warmups WARMUPS     Specify the amount of samples from warm-up rounds.
                        (default: 0)
  --ymax YMAX           Max value of the y-axis (default: None)
  --ymin YMIN           Min value of the y-axis (default: None)
  --ylog                Makes the y-axis logarithmic (default: False)
  --ylabel YLABEL       Label on the y-axis (default: Elapsed time in seconds)
  --plot-size PLOT_SIZE
                        Size of the plot compared to total figure (0.1 to 1.0)
                        (default: 0.7)
  --xticklabel-rotation DEGREE
                        The rotation of the labels on the x-axis (default: 90)
  --no-legend           Hide the legend box (default: False)
  --fontsize FONTSIZE   Fontsize (default: 12)
  --pyplot-style PYPLOT_STYLE
                        The matplotlib.pyplot.style to use (default: seaborn-
                        darkgrid)
  --meta-key {creation_date_utc,tag}
                        The meta key, which value vary throughout the series
                        (default: creation_date_utc)
  --mpld3               Generate mpld3 plot <https://mpld3.github.io>
                        (default: False)

bp-raw

usage: raw.py [-h] [-o FILE] [--parse-regex RegEx] [--py-type {float,int,str}]
              [--labels-to-include RegEx] [--labels-to-exclude RegEx]
              [--label-map RegEx:label,...,RegEx:label] [--no-stdout]
              [--no-stderr]
              FILE

Raw dump of the stdout and stderr in a Benchpress JSON-file.

positional arguments:
  FILE                  JSON file containing results

optional arguments:
  -h, --help            show this help message and exit
  -o FILE, --output FILE
                        Write output to FILE. (default: None)
  --parse-regex RegEx   How to parse the result of each run. For each RegEx
                        match, group one is recorded as a result. (default:
                        elapsed-time: ([\d.]+))
  --py-type {float,int,str}
                        The Python data type of the parsed results. (default:
                        float)
  --labels-to-include RegEx
                        All labels that match the RegEx are showed. (default:
                        .*)
  --labels-to-exclude RegEx
                        All labels that match the RegEx are ignored. (default:
                        None)
  --label-map RegEx:label,...,RegEx:label
                        Comma separated list of original-to-new-label names
                        (default: None)
  --no-stdout           Don't dump stdout. (default: False)
  --no-stderr           Don't dump stderr. (default: False)

Reference

Release:3.0.5.post10
Date:Feb 19, 2019

Benchpress: a benchmark tool and collection

Benchpress is primarily a tool for running benchmarks and analyze/visualize the result.

The workflow:
  • Write a Python program that generates the commands to run and write the commands to a JSON file
  • Use bp-run to run the JSON file
  • Use a visualizer such as bp-cli or bp-chart to visualize the results within the JSON file

Benchpress

benchpress.benchpress.command(cmd, label, env={})[source]

Create a Benchpress command, which define a single benchmark execution

This is a help function to create a Benchpress command, which is a Python dict of the parameters given.

Parameters:
  • cmd (str) – The bash string that makes up the command
  • label (str) – The human readable label of the command
  • env (dict) – The Python dictionary of environment variables to define before execution’
Returns:

command – The created Benchpress command

Return type:

dict

benchpress.benchpress.create_suite(cmd_list, output_path=None)[source]

Create a suite file (JSON) based on a list of commands

Parameters:
  • cmd_list (list of dict) – List of commands that makes up this benchmark suite.
  • output_path (str) – Path to the output file when the –output argument is unset. If None, a path to a temporary file is used.

See also

command()
Help function to create a new command

Utilities for Generating Commands

benchpress.suite_util.benchmark_path(name, implementation, extension)[source]

Returns the path to the executable of a benchmark implementation in the include suites.

Parameters:
  • name (str) – The name of the benchmark e.g. ‘montecarlo_pi’
  • implementation (str) – The name of the implementation e.g. ‘python_numpy’
  • extension (str) – The extension of the executable e.g. ‘.py’
Returns:

path – Absolute path to the executable

Return type:

str

Suite Examples

NAS Parallel Benchmarks by NASA

This script generates a Benchpress suite file that will run the NAS benchmark suite <https://www.nas.nasa.gov/publications/npb.html>. It download and compile the NAS benchmark based on the configuration specified with ‘–nas-config’

Example

Use the GCC config:

python nas_parallel_benchmark.py --nas-config "make.def.gcc_x86" --output nas_suite.json

And then run the suite:

bp-run nas_suite.json

Finally, check the result:

# Wall clock timings
bp-cli nas_suite.json --parse-regex "Time in seconds\s*=(.+)"

# Mop/s
bp-cli nas_suite.json --parse-regex "Mop/s total\s*=(.+)"

The Source:

"""
===============================
NAS Parallel Benchmarks by NASA
===============================

This script generates a Benchpress suite file that will run the NAS benchmark suite 
<https://www.nas.nasa.gov/publications/npb.html>.
It download and compile the NAS benchmark based on the configuration specified with '--nas-config'

Example
-------

Use the GCC config::
    
    python nas_parallel_benchmark.py --nas-config "make.def.gcc_x86" --output nas_suite.json

And then *run* the suite::

    bp-run nas_suite.json
    
Finally, check the result:: 

    # Wall clock timings
    bp-cli nas_suite.json --parse-regex "Time in seconds\s*=(.+)"
    
    # Mop/s
    bp-cli nas_suite.json --parse-regex "Mop/s total\s*=(.+)"

"""
import urllib
import tarfile
import tempfile
import subprocess
import os
from os.path import join
import benchpress as bp
from benchpress.argument_handling import args, add_argument

if __name__ == "__main__":

    # Adding arguments that are specific to this Benchpress suite
    add_argument(
        '--nas-config',
        default="make.def.gcc_x86",
        help="The name of the NAS config to use."
    )

    # Let's download and unpack the NAS benchmark
    tdir = tempfile.mkdtemp()
    nas_filename = join(tdir, "NAS.tgz")
    print ("Download NAS benchmark to '%s'" % nas_filename)
    urllib.urlretrieve("https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz", nas_filename)
    tar = tarfile.open(nas_filename)
    tar.extractall(path=tdir)
    tar.close()

    # Choose NAS config
    print ("Using NAS config '%s'" % args().nas_config)
    wdir = join(tdir, "NPB3.3.1/NPB3.3-OMP")
    os.rename("%s/config/NAS.samples/%s" % (wdir, args().nas_config), "%s/config/make.def" % wdir)

    benchmarks = ['bt', 'cg', 'dc', 'ft', 'is', 'lu', 'mg', 'sp', 'ua']
    size = "W"

    for benchmark in benchmarks:
        subprocess.check_call(["make %s CLASS=%s" % (benchmark, size)], shell=True, cwd=wdir)
    print ("BUILD FINISHED\n\n")

    cmd_list = []
    for label in benchmarks:
        bash_cmd = "{root}/bin/{label}.{size}.x".format(root=wdir, label=label, size=size)
        cmd_list.append(bp.command(bash_cmd, "%s/%s/OMP1" % (label, size), env={'OMP_NUM_THREADS': 1}))

    bp.create_suite(cmd_list)

Benchmark Examples

Benchpress includes a collection of benchmarks:

Benchmarks

21 Benchmarks Python C C++ C#
  Cython Numba Numpy Omp Omp Mpi Seq Armadillo Blitz Boost Eigen Omp Opencl Seq Numcil
Black Scholes     * *   * * *   * [ISU] *   * *
Convolve     *                      
Convolve1D     *                      
Galton Bean Machine     *                      
Gameoflife     *     * [ISU]         * [ISU]      
Gauss     *                      
Heat Equation     * * * *         * * [ISU]   *
Lattice Boltzmann D2Q9     * [IBNP]                      
Leibnitz Pi     *     *         *   *  
Lu     *                      
Magnetic Field Extrapolation * * * [IBNP]                      
Montecarlo Pi     * *   *         *   *  
Nbody     *                     *
Nbody Nice     * [ISU]                     *
  Cython Numba Numpy Omp Omp Mpi Seq Armadillo Blitz Boost Eigen Omp Opencl Seq Numcil
Quasicrystal     *                      
Reactiondiffusion                           *
Rosenbrock     *     *         *   *  
Shallow Water     *     *     *   *   * *
Snakes And Ladders     * [ISU]                      
Wisp     *                      
Xraysim     *                      
[ISU](1, 2, 3, 4, 5, 6) The implementation has issues… such as not using of Benchpress, segfaults, or does not run with Bohrium.
[BH]The implementation makes use of Bohrium specific features, which means that Bohrium is required to run it.
[IBNP](1, 2) The implementation does import bohrium as np, which breaks the Bohrium dogma “High-Performance NumPy without changing a single line of code.

Black Scholes

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np
import math

try:
    import numpy_force as npf
except ImportError:
    import numpy as npf

bench = util.Benchmark("Black Scholes", "samples*iterations")


def model(N):
    """Construct pseudo-data representing price samples?"""
    # Create the outside of bohrium
    S = npf.random.random(N).astype(bench.dtype) * 4.0 + 58.0  # Price is between 58-62
    S = np.array(S)
    return S


def CND(X):
    """
    Cumulative normal distribution.
    Helper function used by BS(...).
    """

    (a1, a2, a3, a4, a5) = (0.31938153, -0.356563782, 1.781477937, -1.821255978, 1.330274429)
    L = np.absolute(X)
    K = 1.0 / (1.0 + 0.2316419 * L)
    w = 1.0 - 1.0 / math.sqrt(2 * np.pi) * np.exp(-L * L / 2.) * \
        (a1 * K + \
         a2 * (K ** 2) + \
         a3 * (K ** 3) + \
         a4 * (K ** 4) + \
         a5 * (K ** 5))

    mask = X < 0
    w = w * ~mask + (1.0 - w) * mask

    return w


def BS(CallPutFlag, S, X, T, r, v):
    """Black Sholes Function."""

    d1 = (np.log(S / X) + (r + v * v / 2.) * T) / (v * math.sqrt(T))
    d2 = d1 - v * math.sqrt(T)
    if CallPutFlag == 'c':
        return S * CND(d1) - X * math.exp(-r * T) * CND(d2)
    else:
        return X * math.exp(-r * T) * CND(-d2) - S * CND(-d1)


def price(S, I, flag='c', X=65.0, dT=(1.0 / 365.0), r=0.08, v=0.3):
    """Model price as a function of time."""

    T = dT
    Ps = []
    N = len(S)
    for i in range(I):
        P = np.sum(BS(flag, S, X, T, r, v)) / N
        Ps.append(P)
        T += dT
        bench.flush()

    return Ps


def main():
    N = bench.args.size[0]
    I = bench.args.size[1]

    S = model(N)  # Construct pseudo-data\

    bench.start()
    R = price(S, I)  # Run the model
    bench.stop()
    bench.pprint()


if __name__ == "__main__":
    main()

C99 Omp

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double* model(int64_t samples)
{
    const uint64_t key = 0;
    double* data = (double*)malloc(sizeof(double)*samples);

    #pragma omp parallel for
    for(int64_t count = 0; count<samples; ++count) {
        philox2x32_as_1x64_t counter;
        counter.combined = count;

        uint64_t x_raw = ((philox2x32_as_1x64_t)philox2x32(
          counter.orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double x = x_raw;
        x /= 18446744073709551616.000000;
        x *= 4.0;
        x += 58.0;          // Model between 58-62

        data[count] = x;
    }
    return data;
}

// The cumulative normal distribution function 
double cnd( double x )
{
    double L, K, w;

    double const a1 =  0.31938153,
                 a2 = -0.356563782,
                 a3 =  1.781477937,
                 a4 = -1.821255978,
                 a5 =  1.330274429;

    L = fabs(x);
    K = 1.0 / (1.0 + 0.2316419 * L);
    w = 1.0 - 1.0 / sqrt(2 * M_PI) * exp(-L *L / 2) * (\
        a1 * K         + \
        a2 * pow(K, 2) + \
        a3 * pow(K, 3) + \
        a4 * pow(K, 4) + \
        a5 * pow(K, 5)
    );

    return (x<0) ? 1.0 - w : w;
}

// The Black and Scholes (1973) Stock option formula
void pricing(double* market, double *prices,
             size_t samples, size_t iterations,
             char flag, double x, double d_t, double r, double v)
{
    double t = d_t;

    for(size_t iter=0; iter<iterations; ++iter) {
        double res = 0;
        #pragma omp parallel for reduction(+:res)
        for(size_t sample=0; sample<samples; ++sample) {
            double d1 = (log(market[sample]/x) + (r+v*v/2)*t) / (v*sqrt(t));
            double d2 = d1-v*sqrt(t);

            if (flag == 'c') {
                res += market[sample] *cnd(d1)-x * exp(-r*t)*cnd(d2);
            } else {
                res += x * exp(-r*t) * cnd(-d2) - market[sample] * cnd(-d1);
            }
        }
        t += d_t;
        prices[iter] = res / samples;
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 0;
    }
    const int samples    = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    double* market = model(samples);        // Generate pseudo-market data
    double* prices = (double*)malloc(sizeof(double)*iterations); // Prices

    bp.timer_start();
    pricing(
        market, prices,
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();
    
    bp.print("black_scholes(c99_omp)");
    if (bp.args.verbose) {                 // and values.
        printf("output: [ ");
        for(int i=0; i<iterations; ++i) {
            printf("%f ", prices[i]);
        }
        printf("]\n");
    }

    free(market);
    free(prices);
    return 0;
}

C99 Seq

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double* model(int64_t samples)
{
    const uint64_t key = 0;
    double* data = (double*)malloc(sizeof(double)*samples);

    for(int64_t count = 0; count<samples; ++count) {
        philox2x32_as_1x64_t counter;
        counter.combined = count;

        uint64_t x_raw = ((philox2x32_as_1x64_t)philox2x32(
          counter.orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double x = x_raw;
        x /= 18446744073709551616.000000;
        x *= 4.0;
        x += 58.0;          // Model between 58-62

        data[count] = x;
    }
    return data;
}

// The cumulative normal distribution function 
double cnd( double x )
{
    double L, K, w;

    double const a1 =  0.31938153,
                 a2 = -0.356563782,
                 a3 =  1.781477937,
                 a4 = -1.821255978,
                 a5 =  1.330274429;

    L = fabs(x);
    K = 1.0 / (1.0 + 0.2316419 * L);
    double K2 = K*K;
    double K4 = K2*K2;
    w = 1.0 - 1.0 / sqrt(2 * M_PI) * exp(-L *L / 2) * (\
        a1 * K         + \
        a2 * K2 + \
        a3 * K*K2 + \
        a4 * K4 + \
        a5 * K*K4
    );

    return (x<0) ? 1.0 - w : w;
}

// The Black and Scholes (1973) Stock option formula
void pricing(double* market, double *prices,
             size_t samples, size_t iterations,
             char flag, double x, double d_t, double r, double v)
{
    double t = d_t;

    for(size_t iter=0; iter<iterations; ++iter) {
        double res = 0;
        for(size_t sample=0; sample<samples; ++sample) {
            double d1 = (log(market[sample]/x) + (r+v*v/2)*t) / (v*sqrt(t));
            double d2 = d1-v*sqrt(t);

            if (flag == 'c') {
                res += market[sample] *cnd(d1)-x * exp(-r*t)*cnd(d2);
            } else {
                res += x * exp(-r*t) * cnd(-d2) - market[sample] * cnd(-d1);
            }
        }
        t += d_t;
        prices[iter] = res / samples;
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 0;
    }
    const int samples    = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    double* market = model(samples);        // Generate pseudo-market data
    double* prices = (double*)malloc(sizeof(double)*iterations); // Prices

    bp.timer_start();
    pricing(
        market, prices,
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();
    
    bp.print("black_scholes(c99_seq)");
    if (bp.args.verbose) {                 // and values.
        printf("output: [ ");
        for(int i=0; i<iterations; ++i) {
            printf("%f ", prices[i]);
        }
        printf("]\n");
    }

    free(market);
    free(prices);
    return 0;
}

Cpp11 Armadillo

/*
This file is part of Bohrium and Copyright (c) 2012 the Bohrium team:
http://bohrium.bitbucket.org

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include <armadillo>
#include <bp_util.h>

using namespace std;
using namespace arma;

template <typename T>
Col<T> cnd(Col<T> x)
{
    size_t samples = x.n_elem;
    Col<T> l(samples), k(samples), w(samples);
    T a1 = 0.31938153,
      a2 =-0.356563782,
      a3 = 1.781477937,
      a4 =-1.821255978,
      a5 = 1.330274429,
      pp = 2.5066282746310002; // sqrt(2.0*PI)

    l = abs(x);
    k = 1.0 / (1.0 + 0.2316419 * l);

    w = 1.0 - 1.0 / pp * exp(-1.0*l%l/2.0) % \
        (a1*k + \
         a2*(pow(k,(T)2)) + \
         a3*(pow(k,(T)3)) + \
         a4*(pow(k,(T)4)) + \
         a5*(pow(k,(T)5)));

    uvec mask = x < 0.0;

    return w % (-mask) + (1.0-w) % mask;
}

template <typename T>
T* pricing(size_t samples, size_t iterations, char flag, T x, T d_t, T r, T v)
{
    T* p    = (T*)malloc(sizeof(T)*samples);    // Intermediate results
    T t     = d_t;                              // Initial delta

    Col<T> d1(samples), d2(samples), res(samples);
    Col<T> s = randu<Col<T> >(samples)*4.0 +58.0;      // Model between 58-62

    for(size_t i=0; i<iterations; i++) {
        d1 = (log(s/x) + (r+v*v/2.0)*t) / (v*sqrt(t));
        d2 = d1-v*sqrt(t);
        if (flag == 'c') {
            res = s % cnd<T>(d1) -x * exp(-r*t) * cnd<T>(d2);
        } else {
            res = x * exp(-r*t) * cnd<T>(-1.0*d2) - s*cnd<T>(-1.0*d1);
        }

        t += d_t;                               // Increment delta
        p[i] = sum(res) / (T)samples;           // Result from timestep
    }

    return p;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 1;
    }
    const size_t samples    = bp.args.sizes[0];
    const size_t iterations = bp.args.sizes[1];

    bp.timer_start();
    double* prices = pricing(
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();

    bp.print("black_scholes(cpp11_armadillo)");
    if (bp.args.verbose) {
        cout << ", \"output\": [";
        for(size_t i=0; i<iterations; i++) {
            cout << prices[i];
            if (iterations-1!=i) {
                cout << ", ";
            }
        }
        cout << "]" << endl;
    }

    free(prices);
    return 0;
}

Cpp11 Blitz

#include <blitz/array.h>
#include <random/uniform.h>
#include <bp_util.h>

using namespace blitz;
using namespace ranlib;

template <typename T>
Array<T, 1> cnd(Array<T, 1> & x)
{
    int samples = x.numElements();
    Array<T, 1> l(samples), k(samples), w(samples), res(samples);
    Array<bool, 1> mask(samples);
    T a1 = 0.31938153,
      a2 =-0.356563782,
      a3 = 1.781477937,
      a4 =-1.821255978,
      a5 = 1.330274429,
      pp = 2.5066282746310002; // sqrt(2.0*PI)

    l = abs(x);
    k = 1.0 / (1.0 + 0.2316419 * l);
    w = 1.0 - 1.0 / pp * exp(-1.0*l*l/2.0) * \
        (a1*k + \
         a2*(pow(k,(T)2)) + \
         a3*(pow(k,(T)3)) + \
         a4*(pow(k,(T)4)) + \
         a5*(pow(k,(T)5)));

    mask    = x < 0.0;
    res     = (w * cast<T>(!mask) + (1.0-w)* cast<T>(mask));
    return res;
}

template <typename T>
T* pricing(size_t samples, size_t iterations, char flag, T x, T d_t, T r, T v)
{
    Array<T, 1> d1(samples), d2(samples), res(samples);
    T* p    = (T*)malloc(sizeof(T)*samples);    // Intermediate results
    T t     = d_t;                              // Initial delta

    Array<T, 1> s(samples);                     // Initial uniform sampling
    Uniform<T> rand;                            // values between 58-62
    rand.seed((unsigned int)time(0));
    s = rand.random() *4.0 +58.0;

    for(size_t i=0; i<iterations; i++) {
        d1 = (log(s/x) + (r+v*v/2.0)*t) / (v*sqrt(t));
        d2 = d1-v*sqrt(t);
        if (flag == 'c') {
            res = s * cnd(d1) - x * exp(-r * t) * cnd(d2);
        } else {
            Array<T, 1> tmp1(samples), tmp2(samples);
            tmp1 = -1.0*d2;
            tmp2 = -1.0*d1;

            res = x * exp(-r*t) * cnd(tmp1) - s*cnd(tmp2);
        }
        t += d_t;                               // Increment delta
        p[i] = sum(res) / (T)samples;           // Result from timestep
    }

    return p;
}

//FLOP count: 2*s+i*(s*8+2*s*23) where s is samples and i is iterations

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 1;
    }
    const size_t samples    = bp.args.sizes[0];
    const size_t iterations = bp.args.sizes[1];

    bp.timer_start();
    double* prices = pricing(
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();
    
    bp.print("black_scholes(cpp11_blitz)");
    if (bp.args.verbose) {
        cout << ", \"output\": [";
        for(size_t i=0; i<iterations; i++) {
            cout << prices[i];
            if (iterations-1!=i) {
                cout << ", ";
            }
        }
        cout << "]" << endl;
    }

    free(prices);
    return 0;
}

Cpp11 Eigen

Error

There are issues with the implementation.

Compilation errors.

/*
This file is part of Bohrium and Copyright (c) 2012 the Bohrium team:
http://bohrium.bitbucket.org

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#include <iostream>
#include <Eigen/Dense>
#include <bp_util.h>

using namespace std;
using namespace Eigen;

template <typename T>
Array<T, Dynamic, 1>& cnd(const Array<T, Dynamic, 1>& x)
{
    size_t samples = x.size();
    Array<T, Dynamic, 1> l(samples), k(samples), w(samples);
    Array<T, Dynamic, 1> mask(samples);
    Array<bool, Dynamic, 1> mask_bool(samples);

    T a1 = 0.31938153,
      a2 =-0.356563782,
      a3 = 1.781477937,
      a4 =-1.821255978,
      a5 = 1.330274429,
      pp = 2.5066282746310002; // sqrt(2.0*PI)

    l = abs(x);
    //k = (T)1.0 / ((T)1.0 + (T)0.2316419 * l);
    //k = (T)0.5 / l / (T)0.5;
    k = l / (T)0.5;
    w = (T)1.0 - (T)1.0 / (T)pp * exp((T)-(T)1.0*l*l/(T)2.0) * \
        ((T)a1*k + \
         (T)a2*(pow(k,(T)2)) + \
         (T)a3*(pow(k,(T)3)) + \
         (T)a4*(pow(k,(T)4)) + \
         (T)a5*(pow(k,(T)5)));

    mask_bool= (x < (T)0.0);
    mask = mask_bool.cast<T>();
    return w * (mask) + ((T)1.0-w);
    //return w * (mask) + ((T)1.0-w)* mask;
    //return w * (!mask) + (1.0-w)* mask;
}

template <typename T>
T* pricing(size_t samples, size_t iterations, char flag, T x, T d_t, T r, T v)
{
    T* p    = (T*)malloc(sizeof(T)*samples);    // Intermediate results
    T t     = d_t;                              // Initial delta

    Array<T, Dynamic, 1> d1(samples), d2(samples), res(samples);
    Array<T, Dynamic, 1> s = Array<T, Dynamic, 1>::Random(samples)*4.0 +58.0;      // Model between 58-62

    for(size_t i=0; i<iterations; i++) {
        d1 = (log(s/x) + (r+v*v/2.0)*t) / (v*sqrt(t));
        d2 = d1-v*sqrt(t);

        if (flag == 'c') {

            size_t samples = x.size();
            Array<T, Dynamic, 1> l(samples), k(samples), w(samples);
            Array<T, Dynamic, 1> mask(samples);
            Array<bool, Dynamic, 1> mask_bool(samples);

            T a1 = 0.31938153,
              a2 =-0.356563782,
              a3 = 1.781477937,
              a4 =-1.821255978,
              a5 = 1.330274429,
              pp = 2.5066282746310002; // sqrt(2.0*PI)

            l = abs(x);
            //k = (T)1.0 / ((T)1.0 + (T)0.2316419 * l);
            //k = (T)0.5 / l / (T)0.5;
            k = l / (T)0.5;
            w = (T)1.0 - (T)1.0 / (T)pp * exp((T)-(T)1.0*l*l/(T)2.0) * \
                ((T)a1*k + \
                 (T)a2*(pow(k,(T)2)) + \
                 (T)a3*(pow(k,(T)3)) + \
                 (T)a4*(pow(k,(T)4)) + \
                 (T)a5*(pow(k,(T)5)));

            mask_bool= (x < (T)0.0);
            mask = mask_bool.cast<T>();
            res =  w * (!mask) + (1.0-w)* mask;

            //cnd<T>(d1);
            //res = s * cnd<T>(d1) -x * exp(-r*t) * cnd<T>(d2);
        } else {
/*
            Array<T, Dynamic, 1> tmp1(samples), tmp2(samples);
            tmp1 = -1.0*d2;
            tmp2 = -1.0*d1;
            res = x * exp(-r*t) * cnd<T>(tmp1) - s*cnd<T>(tmp2);
    */
        }

        t += d_t;                               // Increment delta
        p[i] = res.sum() / (T)samples;           // Result from timestep
    }

    return p;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);    // Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const size_t samples    = bp.args.sizes[0];
    const size_t iterations = bp.args.sizes[1];

    bp.timer_start();                                   // Start timer
    double* prices = pricing(                           // Run...
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();                                    // Stop timer
    
    bp.print("black_scholes(cpp11_bxx)");               // Print restults..
    if (bp.args.verbose) {                              // ..verbosely.
        cout << ", \"output\": [";
        for(size_t i=0; i<iterations; i++) {
            cout << prices[i];
            if (iterations-1!=i) {
                cout << ", ";
            }
        }
        cout << "]" << endl;
    }
    free(prices);                                       // Cleanup

    return 0;
}

Cpp11 Omp

#include <iostream>
#include <sstream>
#include <string>
#include <cmath>
#include <Random123/philox.h>
#include <bp_util.h>

using namespace std;

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double* model(int64_t samples)
{
    const uint64_t key = 0;
    double* data = (double*)malloc(sizeof(double)*samples);

    #pragma omp parallel for
    for(int64_t count = 0; count<samples; ++count) {
        philox2x32_as_1x64_t counter;
        counter.combined = count;

        philox2x32_as_1x64_t x_philox;

        x_philox.orig = philox2x32(
          counter.orig,
          (philox2x32_key_t){ { key } } 
        );

        double x = x_philox.combined;
        x /= 18446744073709551616.000000;
        x *= 4.0;
        x += 58.0;          // Model between 58-62

        data[count] = x;
    }
    return data;
}

// The cumulative normal distribution function 
double cnd( double x )
{
    double L, K, w;

    double const a1 =  0.31938153,
                 a2 = -0.356563782,
                 a3 =  1.781477937,
                 a4 = -1.821255978,
                 a5 =  1.330274429;

    L = fabs(x);
    K = 1.0 / (1.0 + 0.2316419 * L);
    w = 1.0 - 1.0 / sqrt(2 * M_PI) * exp(-L *L / 2) * (\
        a1 * K         + \
        a2 * pow(K, 2) + \
        a3 * pow(K, 3) + \
        a4 * pow(K, 4) + \
        a5 * pow(K, 5)
    );

    return (x<0) ? 1.0 - w : w;
}

// The Black and Scholes (1973) Stock option formula
void pricing(double* market, double *prices,
             size_t samples, size_t iterations,
             char flag, double x, double d_t, double r, double v)
{
    double t = d_t;

    for(size_t iter=0; iter<iterations; ++iter) {
        double res = 0;
        #pragma omp parallel for reduction(+:res)
        for(size_t sample=0; sample<samples; ++sample) {
            double d1 = (log(market[sample]/x) + (r+v*v/2)*t) / (v*sqrt(t));
            double d2 = d1-v*sqrt(t);

            if (flag == 'c') {
                res += market[sample] *cnd(d1)-x * exp(-r*t)*cnd(d2);
            } else {
                res += x * exp(-r*t) * cnd(-d2) - market[sample] * cnd(-d1);
            }
        }
        t += d_t;
        prices[iter] = res / samples;
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 0;
    }
    const int samples    = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    double* market = model(samples);        // Generate pseudo-market data
    double* prices = (double*)malloc(sizeof(double)*iterations); // Prices

    bp.timer_start();
    pricing(
        market, prices,
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();
    
    bp.print("black_scholes(cpp11_omp)");
    if (bp.args.verbose) {                 // and values.
        printf("output: [ ");
        for(int i=0; i<iterations; ++i) {
            printf("%f ", prices[i]);
        }
        printf("]\n");
    }

    free(market);
    free(prices);
    return 0;
}

Cpp11 Seq

#include <iostream>
#include <sstream>
#include <string>
#include <cmath>
#include <Random123/philox.h>
#include <bp_util.h>

using namespace std;

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double* model(int64_t samples)
{
    const uint64_t key = 0;
    double* data = (double*)malloc(sizeof(double)*samples);

    for(int64_t count = 0; count<samples; ++count) {
        philox2x32_as_1x64_t counter;
        counter.combined = count;

        philox2x32_as_1x64_t x_philox;

        x_philox.orig = philox2x32(
          counter.orig,
          (philox2x32_key_t){ { key } } 
        );

        double x = x_philox.combined;
        x /= 18446744073709551616.000000;
        x *= 4.0;
        x += 58.0;          // Model between 58-62

        data[count] = x;
    }
    return data;
}

// The cumulative normal distribution function 
double cnd( double x )
{
    double L, K, w;

    double const a1 =  0.31938153,
                 a2 = -0.356563782,
                 a3 =  1.781477937,
                 a4 = -1.821255978,
                 a5 =  1.330274429;

    L = fabs(x);
    K = 1.0 / (1.0 + 0.2316419 * L);
    w = 1.0 - 1.0 / sqrt(2 * M_PI) * exp(-L *L / 2) * (\
        a1 * K         + \
        a2 * pow(K, 2) + \
        a3 * pow(K, 3) + \
        a4 * pow(K, 4) + \
        a5 * pow(K, 5)
    );

    return (x<0) ? 1.0 - w : w;
}

// The Black and Scholes (1973) Stock option formula
void pricing(double* market, double *prices,
             size_t samples, size_t iterations,
             char flag, double x, double d_t, double r, double v)
{
    double t = d_t;

    for(size_t iter=0; iter<iterations; ++iter) {
        double res = 0;
        for(size_t sample=0; sample<samples; ++sample) {
            double d1 = (log(market[sample]/x) + (r+v*v/2)*t) / (v*sqrt(t));
            double d2 = d1-v*sqrt(t);

            if (flag == 'c') {
                res += market[sample] *cnd(d1)-x * exp(-r*t)*cnd(d2);
            } else {
                res += x * exp(-r*t) * cnd(-d2) - market[sample] * cnd(-d1);
            }
        }
        t += d_t;
        prices[iter] = res / samples;
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);
    if (bp.args.has_error) {
        return 0;
    }
    const int samples    = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    double* market = model(samples);        // Generate pseudo-market data
    double* prices = (double*)malloc(sizeof(double)*iterations); // Prices

    bp.timer_start();
    pricing(
        market, prices,
        samples, iterations,
        'c', 65.0, 1.0 / 365.0,
        0.08, 0.3
    );
    bp.timer_stop();
    
    bp.print("black_scholes(cpp11_seq)");
    if (bp.args.verbose) {                 // and values.
        printf("output: [ ");
        for(int i=0; i<iterations; ++i) {
            printf("%f ", prices[i]);
        }
        printf("]\n");
    }

    free(market);
    free(prices);
    return 0;
}

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with Bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using R = NumCIL.Range;

namespace blackscholes
{
	using NumCIL.Double;
	using DATA = System.Double;
	
    public static class BlackScholesSolverDouble
    {
        private static NdArray CND(NdArray X)
        {
            DATA a1 = 0.31938153f, a2 = -0.356563782f, a3 = 1.781477937f, a4 = -1.821255978f, a5 = 1.330274429f;
            var L = X.Abs();
            var K = 1.0f / (1.0f + 0.2316419f * L);
            var w = 1.0f - 1.0f / ((DATA)Math.Sqrt(2 * Math.PI)) * (-L * L / 2.0f).Exp() * (a1 * K + a2 * (K.Pow(2)) + a3 * (K.Pow(3)) + a4 * (K.Pow(4)) + a5 * (K.Pow(5)));
            
            var mask1 = (NdArray)(X < 0);
            var mask2 = (NdArray)(X >= 0);

            w = w * mask2 + (1.0f - w) * mask1;
            return w;
        }

        private static NdArray BlackScholes(bool callputflag, NdArray S, DATA X, DATA T, DATA r, DATA v)
        {
            var d1 = ((S / X).Log() + (r + v * v / 2.0f) * T) / (v * (DATA)Math.Sqrt(T));
            var d2 = d1 - v * (DATA)Math.Sqrt(T);

            if (callputflag)
                return S * CND(d1) - X * (DATA)Math.Exp(-r * T) * CND(d2);
            else
                return X * (DATA)Math.Exp(-r * T) * CND(-d2) - S * CND(-d1);
        }
        
        public static NdArray Create(long size)
        {
            var S = Generate.Random(size);
            S = S * 4.0f - 2.0f + 60.0f; //Price is 58-62
            return S;
        }
        
        public static DATA Sync(NdArray data)
        {
        	return data.Value[0];
        }

        public static DATA Solve(NdArray data, long years)
        {
            DATA X = 65.0f;
            DATA r = 0.08f;
            DATA v = 0.3f;

            DATA day=1.0f/years;
            DATA T = day;

            DATA total = 0.0f;

            for (long i = 0; i < years; i++)
            {
                total += Add.Reduce(BlackScholes(true, data, X, T, r, v)).Value[0] / data.Shape.Length;
                T += day;
            }

            return total;
        }
    }
}

namespace blackscholes
{
	using NumCIL.Float;
	using DATA = System.Single;
	
    public static class BlackScholesSolverSingle
    {
        private static NdArray CND(NdArray X)
        {
            DATA a1 = 0.31938153f, a2 = -0.356563782f, a3 = 1.781477937f, a4 = -1.821255978f, a5 = 1.330274429f;
            var L = X.Abs();
            var K = 1.0f / (1.0f + 0.2316419f * L);
            var w = 1.0f - 1.0f / ((DATA)Math.Sqrt(2 * Math.PI)) * (-L * L / 2.0f).Exp() * (a1 * K + a2 * (K.Pow(2)) + a3 * (K.Pow(3)) + a4 * (K.Pow(4)) + a5 * (K.Pow(5)));
            
            var mask1 = (NdArray)(X < 0);
            var mask2 = (NdArray)(X >= 0);

            w = w * mask2 + (1.0f - w) * mask1;
            return w;
        }

        private static NdArray BlackScholes(bool callputflag, NdArray S, DATA X, DATA T, DATA r, DATA v)
        {
            var d1 = ((S / X).Log() + (r + v * v / 2.0f) * T) / (v * (DATA)Math.Sqrt(T));
            var d2 = d1 - v * (DATA)Math.Sqrt(T);

            if (callputflag)
                return S * CND(d1) - X * (DATA)Math.Exp(-r * T) * CND(d2);
            else
                return X * (DATA)Math.Exp(-r * T) * CND(-d2) - S * CND(-d1);
        }
        
        public static NdArray Create(long size)
        {
            var S = Generate.Random(size);
            S = S * 4.0f - 2.0f + 60.0f; //Price is 58-62
            return S;
        }
        
        public static DATA Sync(NdArray data)
        {
        	return data.Value[0];
        }

        public static DATA Solve(NdArray data, long years)
        {
            DATA X = 65.0f;
            DATA r = 0.08f;
            DATA v = 0.3f;

            DATA day=1.0f/years;
            DATA T = day;

            DATA total = 0.0f;

            for (long i = 0; i < years; i++)
            {
                total += Add.Reduce(BlackScholes(true, data, X, T, r, v)).Value[0] / data.Shape.Length;
                T += day;
            }

            return total;
        }
    }
}

Convolve

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util

# if Bohrium is installed we use its convolve else we use the convolve from SciPy
try:
    raise ImportError
    import bohrium as bh
    convolveNd = bh.convolve_scipy
except ImportError:
    import scipy
    from scipy import signal
    convolveNd = signal.convolve

bench = util.Benchmark("Convolution Filter (any dimensional)", "<image-size>*<filter-size>*<ndims>*<niters>")


def main():
    """
    Convolve filter (any dimensional)
    Parameter: `<image-size>*<filter-size>*<ndims>*<niters>`
                where image and filter size is the size of each dimension (not their total size).
    """
    (image_size, filter_size, ndims, I) = bench.args.size

    image = bench.random_array((image_size ** ndims,)).reshape([image_size] * ndims)
    image_filter = bench.random_array((filter_size ** ndims,)).reshape([filter_size] * ndims)

    bench.start()
    for _ in range(I):
        R = convolveNd(image, image_filter)
        bench.flush()
    bench.stop()
    bench.save_data({'res': R})
    bench.pprint()


if __name__ == "__main__":
    main()

Convolve1D

Python Numpy

"""
Convolve filter (1. dimensional)
Parameter: `--size<image-size>*<filter-size>*<niters>`.
"""

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Convolution Filter 1D", "<image-size>*<filter-size>*<niters>")


def main():
    (image_size, filter_size, I) = bench.args.size

    image = bench.random_array((image_size,))
    image_filter = bench.random_array((filter_size,))

    bench.start()
    for _ in range(I):
        R = np.convolve(image, image_filter)
        bench.flush()
    bench.stop()
    bench.save_data({'res': R})
    bench.pprint()


if __name__ == "__main__":
    main()

Galton Bean Machine

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Galton Bean Machine", "<num_of_beans>*<height>")


def bean(num_beans, height):
    return np.sum(np.sign(bench.random_array((num_beans, height)) - 0.5), axis=1)


def main():
    num_beans, height = bench.args.size

    bench.start()
    R = bean(num_beans, height)
    bench.stop()
    bench.save_data({'res': R})
    bench.pprint()

    if bench.args.visualize:
        from matplotlib import pyplot
        bins = 100
        pyplot.hist(R, bins)
        pyplot.title("Galton Normal distribution")
        pyplot.xlabel("Value")
        pyplot.ylabel("Frequency")
        pyplot.show()


if __name__ == "__main__":
    main()

Gameoflife

Python Numpy

from __future__ import print_function

"""
Game of Life
------------

"""
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Game of Life", "height*width*iterations*rules")

SURVIVE_LOW = 2
SURVIVE_HIGH = 3
SPAWN = 3


def world(height, width):
    state = np.ones((height + 2, width + 2), dtype=bench.dtype)
    state[2:-2, 2:-2] = bench.dtype(0)
    return state


def world_zeros(height, width):
    state = np.zeros((height + 2, width + 2), dtype=bench.dtype)
    return state


def play(state, iterations, version=1, visualize=False):
    cells = state[1:-1, 1:-1]
    ul = state[0:-2, 0:-2]
    um = state[0:-2, 1:-1]
    ur = state[0:-2, 2:]
    ml = state[1:-1, 0:-2]
    mr = state[1:-1, 2:]
    ll = state[2:, 0:-2]
    lm = state[2:, 1:-1]
    lr = state[2:, 2:]

    def update():
        """
        This is the first implementation of the game rules.
        """
        neighbors = ul + um + ur + ml + mr + ll + lm + lr  # count neighbors
        live = neighbors * cells  # extract live cells neighbors
        stay = (live >= SURVIVE_LOW) & (live <= SURVIVE_HIGH)  # find cells the stay alive
        dead = neighbors * (cells == 0)  # extract dead cell neighbors
        spawn = dead == SPAWN  # find cells that spaw new life

        cells[:] = stay | spawn  # save result for next iteration

    def update_optimized():
        """
        This is an optimized implementation of the game rules.
        """
        neighbors = ul + um + ur + ml + mr + ll + lm + lr  # Count neighbors

        c1 = (neighbors == SURVIVE_LOW)  # Life conditions
        c2 = (neighbors == SPAWN)

        cells[:] = cells * c1 + c2  # Update

    if version == 1:  # Select the update function
        update_func = update
    elif version == 2:
        update_func = update_optimized

    for i in range(iterations):  # Run the game
        if visualize:
            bench.plot_surface(state, "3d", 16, 1, 0)
        update_func()
        bench.flush()

    return state


def main():
    (H, W, I, V) = bench.args.size

    if V not in [1, 2]:
        raise Exception("Unsupported rule-implementation.")

    S = world(H, W)

    bench.start()
    R = play(S, I, V, bench.args.visualize)
    bench.stop()
    bench.save_data({'res': R})
    bench.pprint()


if __name__ == "__main__":
    main()

C99 Seq

Error

There are issues with the implementation.

In progress…

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <bp_util.h>

void world(double* state, int height, int width)
{
    for (int h=0; h<height; ++h) {
        for (int w=0; w<width; ++w) {
            state[h*width + w] = (((h < 2) && (h > (height-3))) ||
                                 ((w < 2) && (w > (width-3))));
        }
    }
}

void play(double* state, int height, int width, int iterations, int version)
{
    printf("%p %d %d %d %d\n", state, height, width, iterations, version);
}

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 4); // Grab arguments
    if (bp.args.has_error) {
        return 1;
    }

    const int height    = bp.args.sizes[0];
    const int width     = bp.args.sizes[1];
    const int iter      = bp.args.sizes[2];
    const int version   = bp.args.sizes[3];

    size_t grid_size = height*width*sizeof(double);
    double *state = (double*)malloc(grid_size);

    world(state, height, width);

    bp.timer_start();
    play(state, height, width, iter, version);
    bp.timer_stop();

    bp.print("gameoflife(c99_seq)");

    free(state);
    return 0;
}

Cpp11 Omp

Error

There are issues with the implementation.

In progress…

#include <iostream>
#include <iomanip>
#include <bp_util.h>

#define HEIGHT 1000
#define WIDTH  1000

static double SURVIVE_LOW  = 2.0;
static double SURVIVE_HIGH = 3.0;
static double SPAWN        = 3.0;

using namespace std;

template <typename T>
void world(T state)
{
    #pragma omp parallel for collapse(2)
    for(int h=0; h<HEIGHT+2; ++h) {
        for(int w=0; w<WIDTH+2; ++w) {
            state[h][w] = (((h<2) and (h>WIDTH-3)) &&
                           ((w<2) and (w>HEIGHT-3)));
        }
    }
}

template <typename T>
void play(T state, int iterations, int version, int visualize)
{
    cout << state << SURVIVE_LOW << SURVIVE_HIGH << SPAWN << endl;
    cout << state << iterations << version << visualize << endl;
    
    T* cells, ul, um, ur, ml, mr, ll, lm, lr;
}

template <typename T>
void bench(bp_util_type& bp, const int I, const int V)
{
    auto state = new T[HEIGHT][WIDTH];              // Construct matrices
    world(state);

    bp.timer_start();                               // Start timer

    play(state, I, V, bp.args.visualize);

    bp.timer_stop();                                // Stop timer

    bp.print("gameoflife(cpp11_omp)");				// Print results..
    if (bp.args.verbose) {                          // ..and value.
        cout << fixed << setprecision(10)
             << "Result = " << endl;
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 4);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int W = bp.args.sizes[0];
    const int H = bp.args.sizes[1];
    const int I = bp.args.sizes[2];
    const int V = bp.args.sizes[3];

    if (HEIGHT != H) {
        cout << "Multidimensional arrays in C11 does not support dynamic size, so it has to be: " << H << "." << endl;
        return 0;
    }
    if (WIDTH != W) {
        cout << "Multidimensional arrays in C11 does not support dynamic size, so it has to be: " << W << "." << endl;
        return 0;
    }

    bench<double>(bp, I, V);

    return 0;
}

Gauss

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Gaussian elimination on matrix a without pivoting", "<size>")


def gauss(a):
    """
    Perform Gaussian elimination on matrix a without pivoting
    """
    for c in range(1, a.shape[0]):
        a[c:, c - 1:] = a[c:, c - 1:] - (a[c:, c - 1] / a[c - 1, c - 1:c])[:, None] * a[c - 1, c - 1:]
        bench.flush()
    a /= np.diagonal(a)[:, None]
    return a


def main():
    n = bench.args.size[0]

    matrix = bench.load_data()
    if matrix is not None:
        matrix = matrix['matrix']
    else:
        matrix = bench.random_array((n, n))

    bench.start()
    res = gauss(matrix)
    bench.stop()
    bench.save_data({'matrix': res})
    bench.pprint()


if __name__ == "__main__":
    main()

Heat Equation

Python Numpy

from __future__ import print_function
import numpy as np
from benchpress.benchmarks import util

bench = util.Benchmark("Solving the heat equation using the jacobi method", "height*width*iterations")


def init_grid(height, width, dtype=np.float32):
    grid = np.zeros((height + 2, width + 2), dtype=dtype)
    grid[:, 0] = dtype(-273.15)
    grid[:, -1] = dtype(-273.15)
    grid[-1, :] = dtype(-273.15)
    grid[0, :] = dtype(40.0)
    return grid


def jacobi(grid, epsilon=0.005, max_iterations=None):
    def loop_body(grid):
        center = grid[1:-1, 1:-1]
        north = grid[0:-2, 1:-1]
        east = grid[1:-1, 2:]
        west = grid[1:-1, 0:-2]
        south = grid[2:, 1:-1]
        work = 0.2 * (center + north + east + west + south)
        delta = np.sum(np.absolute(work - center))
        center[:] = work
        bench.plot_surface(grid, "2d", 0, 200, -200)
        return delta > epsilon

    bench.do_while(loop_body, max_iterations, grid)
    return grid


def main():
    H = bench.args.size[0]
    W = bench.args.size[1]
    I = bench.args.size[2]

    grid = bench.load_data()
    if grid is not None:
        grid = grid['grid']
    else:
        grid = init_grid(H, W, dtype=bench.dtype)

    bench.start()
    grid = jacobi(grid, max_iterations=I)
    bench.stop()
    bench.save_data({'grid': grid})
    bench.pprint()


if __name__ == "__main__":
    main()

C99 Omp

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <bp_util.h>

/*
Without collapsing
void solve(int height, int width, double *grid, double epsilon, int max_iterations)
{
    double *T = (double*)malloc(height*width*sizeof(double));

    double delta = epsilon+1.0;
    int iterations = 0;

    while(delta>epsilon) {
        ++iterations;

        #pragma omp parallel for shared(grid, T) reduction(+:delta)
        for(int i=0; i<height-2; ++i)
        {
            int a = i * width;
            const double *up     = &grid[a+1];
            const double *left   = &grid[a+width];
            const double *right  = &grid[a+width+2];
            const double *down   = &grid[a+1+width*2];
            const double *center = &grid[a+width+1];
            double *t_center = &T[a+width+1];

            double delta_local = 0;
            for(int j=0; j<width-2; ++j)
            {
                *t_center = (*center + *up + *left + *right + *down) * 0.2;
                delta_local += fabs(*t_center - *center);
                center++;up++;left++;right++;down++;t_center++;
            }
            delta += delta_local;
        }

        #pragma omp parallel for shared(grid, T)
        for(int i=0; i<height-2; ++i)
        {
            int a = i * width;
            const double *center = &grid[a+width+1];
            double *t_center = &T[a+width+1];

            for(int j=0; j<width-2; ++j)
            {
                *t_center = *center;
                ++t_center;
                ++center;
            }
        }

        if (iterations>=max_iterations) {
            break;
        }
    }
    free(T);
}*/

void solve(int height, int width, double *grid, double epsilon, int max_iterations)
{
    double *T = (double*)malloc(height*width*sizeof(double));

    double delta = epsilon+1.0;
    int iterations = 0;

    while (delta>epsilon) {
        ++iterations;

        #pragma omp parallel for reduction(+:delta) collapse(2)
        for (int i=1; i<height-1; ++i) {
            for (int j=1; j<width-1; ++j) {
                const int a = i * width + j;
                const double *up     = &grid[a-width];
                const double *left   = &grid[a-1];
                const double *right  = &grid[a+1];
                const double *down   = &grid[a+width];
                const double *center = &grid[a];
                double *t_center = &T[a];

                *t_center = (*up + *down + *center + *left + *right) * 0.2;
                delta += fabs(*t_center - *center);
            }
        }

        #pragma omp parallel for collapse(2)
        for (int i=1; i<height-1; ++i) {
            for (int j=1; j<width-1; ++j) {
                const int a = i * width + j;
                const double *center = &grid[a];
                double *t_center = &T[a];

                *t_center = *center;
            }
        }

        if (iterations>=max_iterations) {
            break;
        }
    }
    free(T);
}

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    const int height    = bp.args.sizes[0];
    const int width     = bp.args.sizes[1];
    const int iter      = bp.args.sizes[2];
    double epsilon = 0.005;

    size_t grid_size = height*width*sizeof(double);
    double *grid = (double*)malloc(grid_size);
    //
    // NumaEffects - begin
    //
    // memset(grid, 0, grid_size);  // <--- bad idea.
    // memset is sequentiel and will touch the entire
    // grid on one numa-node.
    //
    // Instead of memset, parallel initialization
    // is performed with the following loop construct:
    //
    double* grid_i = grid;
    #pragma omp parallel for collapse(2)
    for (int i=0; i<height; ++i) {
        for (int j=0; j<width; ++j) {
            *grid_i = 0;
            ++grid_i;
        }
    }
    //
    // NumaEffects - end
    for (int j=0; j<height; j++) {
        grid[j*width]           = -273.15;
        grid[j*width+width-1]   = -273.15;
    }
    for (int j=0; j<width; j++) {
        grid[j+(height-1)*width] = -273.15;
        grid[j]                  = 40.0;
    }

    bp.timer_start();
    solve(height, width, grid, epsilon, iter);
    bp.timer_stop();

    bp.print("heat_equation(c99_omp)");
    free(grid);
    return 0;
}

C99 Omp Mpi

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <mpi.h>
#include <bp_util.h>

inline double innerloop(const double* grid, double* T, int width, int i)
{
    int a = i * width;
    const double *up     = &grid[a+1];
    const double *left   = &grid[a+width];
    const double *right  = &grid[a+width+2];
    const double *down   = &grid[a+1+width*2];
    const double *center = &grid[a+width+1];
    double *t_center = &T[a+width+1];

    double delta = 0;
    for(int j=0; j<width-2; ++j)
    {
        *t_center = (*center + *up + *left + *right + *down) * 0.2;
        delta += fabs(*t_center - *center);
        center++;up++;left++;right++;down++;t_center++;
    }
    return delta;
}

void openmp(int height, int width, double *grid, int iter)
{
    double *T = (double*)malloc(height*width*sizeof(double));
    memcpy(T, grid, height*width*sizeof(double));
    for(int n=0; n<iter; ++n)
    {
        double delta=0;
        #pragma omp parallel for shared(grid,T) reduction(+:delta)
        for(int i=0; i<height-2; ++i)
        {
            delta += innerloop(grid, T, width, i);
        }
        memcpy(grid, T, height*width*sizeof(double));
    }
    free(T);
}

void openmp_mpi(int height, int width, double *grid, int iter)
{
    int myrank, worldsize;
    MPI_Comm comm;
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &worldsize);

    int periods[] = {0};
    MPI_Cart_create(MPI_COMM_WORLD, 1, &worldsize,
                    periods, 1, &comm);

    double *T = (double*)malloc(height*width*sizeof(double));
    memcpy(T, grid, height*width*sizeof(double));
    for(int n=0; n<iter; ++n)
    {
        int p_src, p_dest;
        //Send/receive - neighbor above
        MPI_Cart_shift(comm,0,1,&p_src,&p_dest);
        MPI_Sendrecv(grid+width,width,MPI_DOUBLE,
                     p_dest,1,
                     grid,width, MPI_DOUBLE,
                     p_src,1,comm,MPI_STATUS_IGNORE);
        //Send/receive - neighbor below
        MPI_Cart_shift(comm,0,-1,&p_src,&p_dest);
        MPI_Sendrecv(grid+(height-2)*width, width,MPI_DOUBLE,
                     p_dest,1,
                     grid+(height-1)*width,
                     width,MPI_DOUBLE,
                     p_src,1,comm,MPI_STATUS_IGNORE);

        double delta=0, global_delta;
        #pragma omp parallel for shared(grid,T) reduction(+:delta)
        for(int i=0; i<height-2; ++i)
        {
            delta += innerloop(grid, T, width, i);
        }
        memcpy(grid, T, height*width*sizeof(double));
        MPI_Allreduce(&global_delta, &delta, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
    }
    free(T);
}

int main (int argc, char **argv)
{
    MPI_Init(&argc,&argv);
    int myrank, worldsize;
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &worldsize);

    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    int height      = bp.args.sizes[0];
    const int width = bp.args.sizes[1];
    const int iter  = bp.args.sizes[2];

    //Local vertical size. NB: the horizontal size is always the full grid including borders
    height = (width-2) / worldsize;
    if(myrank == worldsize-1)
        height += (width-2) % worldsize;
    height += 2;//Add a ghost line above and below

    double *grid = malloc(height*width*sizeof(double));
    for(int j=0; j<width;j++)
    {
        if(myrank == 0)
           grid[j] = 40.0;
        if(myrank == worldsize-1)
            grid[j+(height-1)*width] = -273.15;
    }
    for(int j=1; j<height-1;j++)
    {
        grid[j*width] = -273.15;
        grid[j*width+width-1]= -273.15;
    }

    MPI_Barrier(MPI_COMM_WORLD);
    bp.timer_start();
    openmp(height,width,grid,iter);

    MPI_Barrier(MPI_COMM_WORLD);
    bp.timer_stop();

    if (myrank == 0) {
        bp.print("heat_equation(c99_omp_mpi)");
    }
    free(grid);
    MPI_Finalize();
    return 0;
}

C99 Seq

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <bp_util.h>

void sequential(int height, int width, double *grid, int iter)
{
  double *T = (double*)malloc(height*width*sizeof(double));
  memcpy(T, grid, height*width*sizeof(double));
  for(int n=0; n<iter; ++n)
  {
    double *a = grid;
    double *t = T;
    double delta=0;
    for(int i=1; i<height-1; ++i)
    {
      double *up     = a+1;
      double *left   = a+width;
      double *right  = a+width+2;
      double *down   = a+1+width*2;
      double *center = a+width+1;
      double *t_center = t+width+1;
      for(int j=0; j<width-2; ++j)
      {
        *t_center = (*center + *up + *left + *right + *down) * 0.2;
        delta += fabs(*t_center - *center);
        center++;up++;left++;right++;down++;t_center++;
      }
      a += width;
      t += width;
    }
#ifdef DEBUG
    printf("Delta: %lf\n",delta);
#endif
    memcpy(grid, T, height*width*sizeof(double));
  }
  free(T);
}

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }

    const int height    = bp.args.sizes[0];
    const int width     = bp.args.sizes[1];
    const int iter      = bp.args.sizes[2];

    size_t grid_size = height*width*sizeof(double);
    double *grid = (double*)malloc(grid_size);
    memset(grid, 0, grid_size);
    for(int j=0; j<width;j++)
    {
        grid[j] = 40.0;
        grid[j+(height-1)*width] = -273.15;
    }
    for(int j=1; j<height-1;j++)
    {
        grid[j*width] = -273.15;
        grid[j*width+width-1]= -273.15;
    }
#ifdef DEBUG
    for (int i = 0; i<height; i++)
    {
        for(int j=0; j<width;j++)
        {
            printf ("%lf ", grid[j+i*width]);
        }
        printf ("\n");
    }       
#endif
    bp.timer_start();
    sequential(height,width,grid,iter);
    bp.timer_stop();
#ifdef DEBUG
    for (int i = 0; i<height; i++)
    {
        for(int j=0; j<width;j++)
        {
            printf ("%lf ", grid[j+i*width]);
        }
        printf ("\n");
    }       
#endif
    bp.print("heat_equation(c99_seq)");

    free(grid);
    return 0;
}

Cpp11 Omp

#include <iostream>
#include <sstream>
#include <string>
#include <cmath>
#include <bp_util.h>

#define xdim 14000
#define ydim 14000

using namespace std;

template <typename T>
void solve(T grid, double epsilon, int max_iterations)
{
    T temp = new double[ydim][xdim];

    double delta = epsilon+1.0;
    int iterations = 0;            // Compute the heat equation

    while(delta>epsilon) {
        ++iterations;

        #pragma omp parallel for reduction(+:delta) collapse(2)
        for (int i=1; i<ydim-1; i++) {
            for(int j=1;j<xdim-1;j++) {
                temp[i][j] = (grid[i-1][j] + grid[i+1][j] + grid[i][j] + grid[i][j-1] + grid[i][j+1])*0.2;
                delta += abs(temp[i][j] - grid[i][j]);
            }
        }

        #pragma omp parallel for collapse(2)
        for (int i=1;i<ydim-1; i++) {
            for(int j=1;j<xdim-1;j++) {
                grid[i][j] = temp[i][j];
            }
        }

        if (iterations>=max_iterations) {
            break;
        }
    }
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 3);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    bp.timer_start();

    const int width     = bp.args.sizes[0];
    const int height    = bp.args.sizes[1];

    if (ydim != height) {
        cout << "Multidimensional arrays in C11 does not support dynamic size, so it has to be: " << ydim << "." << endl;
        return 0;
    }

    if (xdim != width) {
        cout << "Multidimensional arrays in C11 does not support dynamic size, so it has to be: " << xdim << "." << endl;
        return 0;
    }

    const int max_iterations = bp.args.sizes[2];

    double epsilon  = 0.005;

    auto grid = new double[ydim][xdim];
    #pragma omp parallel for collapse(2)
    for (int i=0; i<ydim; i++) {      // Initialize the grid
        for (int j=0;j<xdim;j++) {
            grid[i][j] = 0;
        }
    }
    for (int i=0; i<ydim; i++) {      // And borders
        grid[i][0]      = -273.15;
        grid[i][xdim-1] = -273.15;
    }
    for (int i=0; i<xdim; i++) {
        grid[0][i]      = -273.15;
        grid[ydim-1][i] = 40.0;
    }

    bp.timer_start();                                   // Start timer
    solve(grid, epsilon, max_iterations);
    bp.timer_stop();

    bp.print("heat_equation(cpp11_omp)");
    if (bp.args.verbose) {                             // and values.
        cout << ", \"output\": [";
        for (int i=0; i<10; ++i) {
            for (int j=0; j<10; ++j) {
                cout << grid[i][j] << ", ";
            }
            cout << endl;
        }
        cout << "]";
    }

    return 0;
}

Cpp11 Opencl

Error

There are issues with the implementation.

Two known issues:

* Implementation compiles (with warning) but execution is untested.

* Implementation does not use ``bp-util`` for argparsing and timing, getting it to run in a suite might be cumbersome...
#include <iostream>
#include <sys/time.h>
#include <CL/cl.hpp>
#include <stdexcept>
#include <sys/time.h>
#include <cstdio>
#include <fstream>
#include <sstream>
#include <utility>

cl::Context context;
std::vector<cl::Device> devices;
cl::CommandQueue commandQueue;
cl::Kernel kernel;

#define TPB 32
#ifndef DTYPE
#define DTYPE double
#endif
#define STR(s) #s
#define xSTR(s) STR(s)

int main (int argc, char** argv)
{
    if (argc != 5)
    {
        std::cout << "Usage: " << argv[0] << " HEAT-EQ-KERNEL-FILE WIDTH HEIGHT ITERATIONS" << std::endl;
        return 0;
    }
    // Commandline arguments
    char *opencl_file = argv[1];
    unsigned int width  = atoi(argv[2]);
    unsigned int height = atoi(argv[3]);
    unsigned int iter   = atoi(argv[4]);

    // Setup OpenCL execution system
    std::vector<cl::Platform> platforms;
    cl::Platform::get(&platforms);
    bool foundPlatform = false;
    std::vector<cl::Platform>::iterator it;
    for (it = platforms.begin(); it != platforms.end(); ++it)
    {
        try {
            cl_context_properties props[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(*it)(),0};
            context = cl::Context(CL_DEVICE_TYPE_GPU, props);
            foundPlatform = true;
            break;
        }
        catch (cl::Error e)
        {
            foundPlatform = false;
        }
    }
    if (foundPlatform)
    {
        devices = context.getInfo<CL_CONTEXT_DEVICES>();
        commandQueue = cl::CommandQueue(context,devices[0],0);
    } else {
        throw std::runtime_error("Could not find valid OpenCL platform.");
    }

    // Read source file and compile code
    std::ifstream file(opencl_file, std::ios::in);
    if (!file.is_open())
    {
        throw std::runtime_error("Could not open source file.");
    }
    std::ostringstream srcstrs;
    srcstrs << "#define DTYPE " << xSTR(DTYPE) << "\n";
    srcstrs << file.rdbuf();
    std::string srcstr = srcstrs.str();
#ifdef DEBUG
    std::cout << srcstr << std::endl;
#endif
    cl::Program::Sources source(1,std::make_pair(srcstr.c_str(),0));
    cl::Program program(context, source);
    try {program.build(devices);}
    catch (cl::Error e)
    {
        std::cout << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0]) << std::endl;
    }
    kernel = cl::Kernel(program, "heat_eq_jacobi");

    // Set up model
    unsigned int w = width+2;
    unsigned int h = height+2;
    size_t grid_size = w*h*sizeof(DTYPE);
    size_t delta_size = w*sizeof(DTYPE);
    DTYPE* grid = (DTYPE*)malloc(grid_size);
    memset(grid, 0, grid_size);
    for(int j=0; j<w;j++)
    {
        grid[j] = 40.0;
        grid[j+(h-1)*w] = -273.15;
    }
    for(int j=1; j<h-1;j++)
    {
        grid[j*w] = -273.15;
        grid[j*w+w-1]= -273.15;
    }
#ifdef DEBUG
    for (int i = 0; i<h; i++)
    {
        for(int j=0; j<w;j++)
        {
            printf ("%lf ", grid[j+i*w]);
        }
        printf ("\n");
    }
#endif
    // Start timing
    timeval t_start,t_end;
    gettimeofday(&t_start,NULL);

    // Set up buffers and copy data
    cl::Buffer inDev = cl::Buffer(context, CL_MEM_READ_WRITE, grid_size, NULL);
    cl::Buffer outDev = cl::Buffer(context, CL_MEM_READ_WRITE, grid_size, NULL);
    cl::Buffer deltaDev = cl::Buffer(context, CL_MEM_READ_WRITE, delta_size, NULL);
    cl::Buffer deltaHost = cl::Buffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, delta_size, NULL);
    DTYPE* deltaHostPtr = (DTYPE*)commandQueue.enqueueMapBuffer(deltaHost, CL_TRUE, CL_MAP_WRITE, 0, delta_size);
    commandQueue.enqueueWriteBuffer(inDev, CL_FALSE, 0, grid_size, grid);
    commandQueue.enqueueCopyBuffer(inDev, outDev, 0, 0, grid_size);

    // Iterating
    for(int n=0; n<iter; ++n)
    {
        kernel.setArg(0,width);
        kernel.setArg(1,height);
        kernel.setArg(2,inDev);
        kernel.setArg(3,outDev);
        kernel.setArg(4,deltaDev);
        commandQueue.enqueueNDRangeKernel(kernel, cl::NullRange,
                                          cl::NDRange((width/TPB+1)*TPB), cl::NDRange(TPB));
        commandQueue.enqueueReadBuffer(deltaDev, CL_FALSE, 0, delta_size, deltaHostPtr);
        commandQueue.finish();
        DTYPE delta = 0.0;
        for (unsigned int i = 0; i < width; ++i)
            delta += deltaHostPtr[i];
#ifdef DEBUG
        std::cout << "Delta: " << delta << std::endl;
#endif
        cl::Buffer tmp = inDev;
        inDev = outDev;
        outDev = tmp;
    }
    commandQueue.enqueueReadBuffer(inDev, CL_TRUE, 0, grid_size, grid);
    gettimeofday(&t_end,NULL);
#ifdef DEBUG
    for (int i = 0; i<h; i++)
    {
        for(int j=0; j<w;j++)
        {
            printf ("%lf ", grid[j+i*w]);
        }
        printf ("\n");
    }
#endif
    double d_time = (t_end.tv_sec - t_start.tv_sec) + (t_end.tv_usec - t_start.tv_usec)/1000000.0;
    std::cout << argv[0] << " - iter: " << iter << " size: " << width << "*" << height << " elapsed-time: " <<
        d_time << std::endl;
    return 0;
}

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with Bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using R = NumCIL.Range;

namespace HeatEquation
{
	using NumCIL.Double;
	using T = System.Double;

    public static class HeatEquationSolverDouble
    {
    	public static NdArray Create(long width, long height)
    	{
			var res = Generate.Zeroes(height + 2, width + 2);

			res[R.All, R.El(0)] = -273.5f;
			res[R.All, R.El(-1)] = -273.5f;
			res[R.El(0), R.All] = 40f;
			res[R.El(-1), R.All] = -273.5f;

			return res;
    	}
    
        public static Tuple<int, NdArray> Solve(NdArray full, long? fixedIterations = null)
        {			
            var center = full[R.Slice(1, -1),  R.Slice(1, -1) ];
            var north  = full[R.Slice(1, -1),  R.Slice(0, -2) ];
            var east   = full[R.Slice(0, -2),  R.Slice(1, -1) ];
            var west   = full[R.Slice(2,  0),  R.Slice(1, -1) ];
            var south  = full[R.Slice(1, -1),  R.Slice(2,  0) ];


            if (fixedIterations != null)
            {
            	for(var i =0; i < fixedIterations.Value; i++)
            		center[R.All] = 0.2f * (center + north + east + west + south);

				return new Tuple<int, NdArray>((int)fixedIterations.Value, full);
            }
            else
            {
				T epsilon = 0.005f;
	            T delta = epsilon + 1;

	            int iteration = 0;

	            while (epsilon < delta)
	            {
					iteration++;

					var work = 0.2f * (center + north + east + west + south);
					delta = (work - center).Abs().Sum();
					center[R.All] = work;
	            }

				return new Tuple<int, NdArray>(iteration, full);
            }

        }
    }
}

namespace HeatEquation
{
	using NumCIL.Float;
	using T = System.Single;

    public static class HeatEquationSolverSingle
    {
    	public static NdArray Create(long width, long height)
    	{
			var res = Generate.Zeroes(height + 2, width + 2);

			res[R.All, R.El(0)] = -273.5f;
			res[R.All, R.El(-1)] = -273.5f;
			res[R.El(0), R.All] = 40f;
			res[R.El(-1), R.All] = -273.5f;

			return res;
    	}
    
        public static Tuple<int, NdArray> Solve(NdArray full, long? fixedIterations = null)
        {			
            var center = full[R.Slice(1, -1),  R.Slice(1, -1) ];
            var north  = full[R.Slice(1, -1),  R.Slice(0, -2) ];
            var east   = full[R.Slice(0, -2),  R.Slice(1, -1) ];
            var west   = full[R.Slice(2,  0),  R.Slice(1, -1) ];
            var south  = full[R.Slice(1, -1),  R.Slice(2,  0) ];


            if (fixedIterations != null)
            {
            	for(var i =0; i < fixedIterations.Value; i++)
            		center[R.All] = 0.2f * (center + north + east + west + south);

				return new Tuple<int, NdArray>((int)fixedIterations.Value, full);
            }
            else
            {
				T epsilon = 0.005f;
	            T delta = epsilon + 1;

	            int iteration = 0;

	            while (epsilon < delta)
	            {
					iteration++;

					var work = 0.2f * (center + north + east + west + south);
					delta = (work - center).Abs().Sum();
					center[R.All] = work;
	            }

				return new Tuple<int, NdArray>(iteration, full);
            }

        }
    }
}

Lattice Boltzmann D2Q9

Python Numpy

from __future__ import print_function

"""
The Lattice Boltzmann Methods D2Q9
------------

Channel flow past a cylindrical obstacle, using a LB method
Copyright (C) 2006 Jonas Latt
Address: Rue General Dufour 24,  1211 Geneva 4, Switzerland
E-mail: Jonas.Latt@cui.unige.ch
"""
import numpy as np
from benchpress.benchmarks import util

bench = util.Benchmark("The Lattice Boltzmann Methods D2Q9", "height*width*iterations")

# D2Q9 Lattice constants
t = [4 / 9., 1 / 9., 1 / 9., 1 / 9., 1 / 9., 1 / 36., 1 / 36., 1 / 36., 1 / 36.]
cx = [0, 1, 0, -1, 0, 1, -1, -1, 1]
cy = [0, 0, 1, 0, -1, 1, 1, -1, -1]
opp = [0, 3, 4, 1, 2, 7, 8, 5, 6]
uMax = 0.02  # maximum velocity of Poiseuille inflow
Re = 100  # Reynolds number


def cylinder(height, width, obstacle=True):
    assert (height > 2)
    assert (width > 2)

    lx = width
    ly = height
    state = {"lx": lx, "ly": ly}
    obst_x = lx / 5. + 1  # position of the cylinder; (exact
    obst_y = ly / 2. + 0  # y-symmetry is avoided)
    obst_r = ly / 10. + 1  # radius of the cylinder

    nu = uMax * 2. * obst_r / Re  # kinematic viscosity
    omega = 1. / (3 * nu + 1. / 2.)  # relaxation parameter

    col = np.arange(1, ly - 1)

    # Place obstacle
    if obstacle:
        y, x = np.meshgrid(np.arange(ly), np.arange(lx))
        obst = (x - obst_x) ** 2 + (y - obst_y) ** 2 <= obst_r ** 2
        obst[:, 0] = obst[:, ly - 1] = 1
        bbRegion = np.asarray(obst, dtype=np.bool)
    else:
        bbRegion = None

    # We need some of the constants as 3D numpy arrays
    t_3d = np.asarray(t)[:, np.newaxis, np.newaxis]
    cx_3d = np.asarray(cx)[:, np.newaxis, np.newaxis]
    cy_3d = np.asarray(cy)[:, np.newaxis, np.newaxis]

    # Initial condition: (rho=0, u=0) ==> fIn[i] = t[i]
    fIn = np.array(t_3d.repeat(lx, axis=1).repeat(ly, axis=2))

    state['fIn'] = fIn
    state['cx_3d'] = cx_3d
    state['cy_3d'] = cy_3d
    state['col'] = col
    state['omega'] = float(omega)
    state['bbRegion'] = bbRegion

    # We save the visualize limits in order to make it constant
    # We find the limit by calculating the first iteration if `ux`
    ux = np.sum(cx_3d * fIn, axis=0) / np.sum(fIn, axis=0)
    L = ly - 2.0
    y = col - 0.5
    ux[0, 1:ly - 1] = 4 * uMax / (L ** 2) * (y * L - y ** 2)
    state['viz_max'] = ux.max()
    state['viz_min'] = ux.min()
    return state


def solve(state, timesteps):
    # load the ready only state
    ly = int(state['ly'])
    lx = int(state['lx'])
    col = state['col']
    cx_3d = state['cx_3d']
    cy_3d = state['cy_3d']
    bbRegion = state['bbRegion']
    omega = state['omega']

    def loop_body(fIn):

        # Macroscopic variables
        rho = np.sum(fIn, axis=0)
        ux = np.sum(cx_3d * fIn, axis=0) / rho
        uy = np.sum(cy_3d * fIn, axis=0) / rho

        # Macroscopic (Dirichlet) boundary conditions

        # Inlet: Poisseuille profile
        L = ly - 2.0
        y = col - 0.5
        ux[0, 1:ly - 1] = 4 * uMax / (L ** 2) * (y * L - y ** 2)
        uy[0, 1:ly - 1] = 0

        # Using a loop instead of "index of indexes"
        # t1 = fIn[[0, 2, 4], 0, 1:ly-1].sum(axis = 0)
        # t1 = np.zeros(ly-2, bohrium=fIn.bohrium)
        t1 = np.zeros(ly - 2)
        for i in [0, 2, 4]:
            t1 += fIn[i, 0, 1:ly - 1]

        # Using a loop instead of "index of indexes"
        # t2 = 2 * fIn[[3, 6, 7], 0, 1:ly-1].sum(axis = 0)
        # t2 = np.zeros(ly-2, bohrium=fIn.bohrium)
        t2 = np.zeros(ly - 2)
        for i in [3, 6, 7]:
            t2 += fIn[i, 0, 1:ly - 1]
        t2 *= 2

        rho[0, 1:ly - 1] = 1 / (1 - ux[0, 1:ly - 1]) * (t1 + t2)

        # Outlet: Zero gradient on rho/ux
        rho[lx - 1, 1:ly - 1] = 4.0 / 3 * rho[lx - 2, 1:ly - 1] - \
                                1.0 / 3 * rho[lx - 3, 1:ly - 1]
        uy[lx - 1, 1:ly - 1] = 0
        ux[lx - 1, 1:ly - 1] = 4.0 / 3 * ux[lx - 2, 1:ly - 1] - \
                               1.0 / 3 * ux[lx - 3, 1:ly - 1]

        # fEq = np.zeros((9, lx, ly), bohrium=fIn.bohrium)
        fEq = np.zeros((9, lx, ly))
        # fOut = np.zeros((9, lx, ly), bohrium=fIn.bohrium)
        fOut = np.zeros((9, lx, ly))
        for i in range(0, 9):
            cu = 3 * (cx[i] * ux + cy[i] * uy)
            fEq[i] = rho * t[i] * (1 + cu + 0.5 * cu ** 2 - \
                                   1.5 * (ux ** 2 + uy ** 2))
            fOut[i] = fIn[i] - omega * (fIn[i] - fEq[i])

        # Microscopic boundary conditions
        for i in range(0, 9):
            # Left boundary:
            fOut[i, 0, 1:ly - 1] = fEq[i, 0, 1:ly - 1] + 18 * t[i] * cx[i] * cy[i] * \
                                   (fIn[7, 0, 1:ly - 1] - fIn[6, 0, 1:ly - 1] - fEq[7, 0, 1:ly - 1] + fEq[6, 0,
                                                                                                      1:ly - 1])
            # Right boundary:
            fOut[i, lx - 1, 1:ly - 1] = fEq[i, lx - 1, 1:ly - 1] + 18 * t[i] * cx[i] * cy[i] * \
                                        (fIn[5, lx - 1, 1:ly - 1] - fIn[8, lx - 1, 1:ly - 1] - \
                                         fEq[5, lx - 1, 1:ly - 1] + fEq[8, lx - 1, 1:ly - 1])
            # Bounce back region:
            # fOut[i,bbRegion] = fIn[opp[i],bbRegion]
            # Using a explict mask
            if bbRegion is not None:
                masked = fIn[opp[i]].copy() * bbRegion
                fOut[i] = fOut[i] * ~bbRegion + masked

        # Streaming step
        for i in range(0, 9):
            # fIn[i] = np.roll(np.roll(fOut[i], cx[i], axis = 0), cy[i], axis = 1)
            # Replacing the np.roll() call with:
            if cx[i] == 1:
                # t1 = np.empty_like(fOut[i], bohrium=fIn.bohrium)
                t1 = np.empty_like(fOut[i])
                t1[1:] = fOut[i][:-1]
                t1[0] = fOut[i][-1]
                fOut[i] = t1
            elif cx[i] == -1:
                # t1 = np.empty_like(fOut[i], bohrium=fIn.bohrium)
                t1 = np.empty_like(fOut[i])
                t1[:-1] = fOut[i][1:]
                t1[-1] = fOut[i][0]
                fOut[i] = t1
            if cy[i] == 1:
                # t1 = np.empty_like(fOut[i], bohrium=fIn.bohrium)
                t1 = np.empty_like(fOut[i])
                t1[:, 1:] = fOut[i][:, :-1]
                t1[:, 0] = fOut[i][:, -1]
                fIn[i] = t1
            elif cy[i] == -1:
                # t1 = np.empty_like(fOut[i], bohrium=fIn.bohrium)
                t1 = np.empty_like(fOut[i])
                t1[:, :-1] = fOut[i][:, 1:]
                t1[:, -1] = fOut[i][:, 0]
                fIn[i] = t1
            else:
                fIn[i] = fOut[i]
        bench.plot_surface(ux.T, "2d", 0, state['viz_max'], state['viz_min'])

    bench.do_while(loop_body, timesteps, state['fIn'])


def main():
    H = bench.args.size[0]
    W = bench.args.size[1]
    I = bench.args.size[2]
    state = bench.load_data()
    if state is None:
        state = cylinder(H, W)
    bench.start()
    solve(state, I)
    bench.stop()
    bench.save_data(state)
    bench.pprint()


if __name__ == "__main__":
    main()

Leibnitz Pi

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Leibnitz Pi", "<size>*<niters>")


def leibnitz_pi(N):
    n = np.arange(0, N)
    return np.sum(1.0 / (4 * n + 1) - 1.0 / (4 * n + 3))


def main():
    (size, niter) = bench.args.size

    bench.start()
    pi = 0.0
    for _ in range(niter):
        pi += 4.0 * leibnitz_pi(size)  # Execute benchmark
        bench.flush()
    pi /= niter
    bench.stop()
    bench.pprint()
    if bench.args.verbose:
        print(pi)


if __name__ == "__main__":
    main()

C99 Seq

#include <stdio.h>
#include <bp_util.h>

double leibnitz_pi(int nelements)
{
    double sum = 0.0;
    for(int n=0; n<nelements; ++n) {
        sum += 1.0/(4*n+1) - 1.0/(4*n+3);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 1);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];

    bp.timer_start();                               // Start timer
    double pi = 4.0*leibnitz_pi(nelements);         // Run benchmark
    bp.timer_stop();                                // Stop timer
    
    bp.print("leibnitz_pi(c99_seq)");               // Print results..
    if (bp.args.verbose) {							// ..and value.
        printf("PI-approximation = %.11f\n", pi);
    }

    return 0;
}

Cpp11 Omp

#include <iostream>
#include <iomanip>
#include <bp_util.h>
#include <omp.h>

using namespace std;

double leibnitz_pi(int nelements)
{
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for(int n=0; n<nelements; ++n) {
        sum += 1.0/(4*n+1) - 1.0/(4*n+3);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 1);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];

    bp.timer_start();                               // Start timer
    double pi = 4.0*leibnitz_pi(nelements);         // Run benchmark
    bp.timer_stop();                                // Stop timer

    bp.print("leibnitz_pi(cpp11_omp)");				// Print results..
    if (bp.args.verbose) {                          // ..and value.
        cout << fixed << setprecision(11)
			 << "PI-approximation = " << pi << endl;
    }

    return 0;
}

Cpp11 Seq

#include <iostream>
#include <iomanip>
#include <bp_util.h>

using namespace std;

double leibnitz_pi(int nelements)
{
    double sum = 0.0;
    for(int n=0; n<nelements; ++n) {
        sum += 1.0/(4*n+1) - 1.0/(4*n+3);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 1);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];

    bp.timer_start();                               // Start timer
    double pi = 4.0*leibnitz_pi(nelements);         // Run benchmark
    bp.timer_stop();                                // Stop timer

    bp.print("leibnitz_pi(cpp11_seq)");				// Print results..
    if (bp.args.verbose) {                          // ..and value.
        cout << fixed << setprecision(11)
			 << "PI-approximation = " << pi << endl;
    }

    return 0;
}

Lu

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("LU decomposition on the matrix so that A = L*U", "<size>")


def lu(a):
    """
    Perform LU decomposition on the matrix `a` so that A = L*U
    """
    u = a.copy()
    l = np.identity(a.shape[0], a.dtype)
    for c in range(1, u.shape[0]):
        l[c:, c - 1] = (u[c:, c - 1] / u[c - 1, c - 1:c])
        u[c:, c - 1:] = u[c:, c - 1:] - l[c:, c - 1][:, None] * u[c - 1, c - 1:]
        bench.flush()
    return (l, u)


def main():
    n = bench.args.size[0]
    matrix = bench.random_array((n, n))
    bench.start()
    res = lu(matrix)
    bench.stop()
    bench.pprint()


if __name__ == "__main__":
    main()

Magnetic Field Extrapolation

Python Cython

import magnetic_field_extrapolation
magnetic_field_extrapolation.main()

Python Numba

# This Python file uses the following encoding: utf-8
from benchpress.benchmarks import util
import numpy as np
from math import cos, sin, exp, pi
import numba
from numba import cuda

def window(B,a=0.37):
    assert (len(B.shape) == 2)
    assert (B.shape[0] == B.shape[1])
    n = B.shape[0]
    wl = np.ones_like(B[0])
    b = int(np.ceil((a * (n-1) / 2)))
    wl[:b]  =  0.5 * (1 + np.cos(pi*(2 * np.arange(b) / (a * (n-1)) - 1)))
    wl[-b:] =  0.5 * (1 + np.cos(pi*(2 * np.arange(b-1,-1,-1) / (a * (n-1)) - 1)))
    wl *= wl
    w = np.sqrt(wl+wl[:,None])
    return B*w

def calcB_loop(B_x0, alpha=0.0,
          x_min = 0.0, x_max = 0.25,
          y_min = 0.0, y_max = 1.0,
          z_min = 0.0, z_max = 1.0):

    n = len(B_x0)

    x = np.linspace(x_min,x_max,num=n,endpoint=False).astype(B_x0.dtype,copy=False)
    y = np.linspace(y_min,y_max,num=n).astype(B_x0.dtype,copy=False)
    z = np.linspace(z_min,z_max,num=n).astype(B_x0.dtype,copy=False)
    u = np.arange(n,dtype=B_x0.dtype)

    # Making C
    C = 4.0 / (n-1.0)**2 * np.sum(np.sum((B_x0 * np.sin(pi/y_max * u * y[:,None])[:,:,None])[:,None] * np.sin(pi/z_max * u * z[:,None])[:,None],-1),-1)
    l = pi**2 * ((u**2 / y_max)[:,None] + (u**2 / z_max))
    l[0,0] = 1.0
    r = np.sqrt(l - alpha**2)
    exprx = np.exp((-r * x[:, None, None]))

    # Calculating B
    Bx = np.empty((n,n,n),dtype=B_x0.dtype)
    By = np.empty((n,n,n),dtype=B_x0.dtype)
    Bz = np.empty((n,n,n),dtype=B_x0.dtype)
    temp_x = np.empty((n, n), dtype=B_x0.dtype)
    temp_y = np.empty((n, n), dtype=B_x0.dtype)
    temp_z = np.empty((n, n), dtype=B_x0.dtype)
    for i in range(n):
        for j in range(n):
            for k in range(n):
                Bx[k, i, j] = 0
                By[k, i, j] = 0
                Bz[k, i, j] = 0
                for m in range(n):
                    sincos = np.sin(pi * k * y[i] / y_max) * (m * np.cos(pi * m * z[j] / z_max))
                    cossin = (k * np.cos(pi * k * y[i] / y_max)) * (np.sin(pi * m * z[j] / z_max))
                    temp_x[k,m] = C[k,m] * (np.sin(pi * k * y[i] / y_max) * (np.sin(pi * m * z[j] / z_max)))
                    temp_y[k,m] = C[k,m] / l[k,m] * (alpha * pi / z_max * sincos - r[k,m] * pi / y_max * cossin)
                    temp_z[k,m] = C[k,m] / l[k,m] * (alpha * pi / y_max * cossin + r[k,m] * pi / z_max * sincos)
            for k in range(n):
                for m in range(n):
                    for q in range(n):
                        Bx[k, i, j] += temp_x[m, q] * exprx[k, m, q]
                        By[k, i, j] += temp_y[m, q] * exprx[k, m, q]
                        Bz[k, i, j] += temp_z[m, q] * exprx[k, m, q]
    return (Bx, By, Bz)


#@cuda.jit()
@numba.jit(nopython=True)
def loop(n, Bx, By, Bz, u, y, y_max, z, z_max, alpha, C, l, r, temp_x, temp_y, temp_z, exprx):
    for i in range(n):
        for j in range(n):
            for k in range(n):
                Bx[k, i, j] = 0
                By[k, i, j] = 0
                Bz[k, i, j] = 0
                for m in range(n):
                    sincos = sin(pi * k * y[i] / y_max) * (m * cos(pi * m * z[j] / z_max))
                    cossin = (k * cos(pi * k * y[i] / y_max)) * (sin(pi * m * z[j] / z_max))
                    temp_x[k,m] = C[k,m] * (sin(pi * k * y[i] / y_max) * (sin(pi * m * z[j] / z_max)))
                    temp_y[k,m] = C[k,m] / l[k,m] * (alpha * pi / z_max * sincos - r[k,m] * pi / y_max * cossin)
                    temp_z[k,m] = C[k,m] / l[k,m] * (alpha * pi / y_max * cossin + r[k,m] * pi / z_max * sincos)
            for k in range(n):
                for m in range(n):
                    for q in range(n):
                        Bx[k, i, j] += temp_x[m, q] * exprx[k, m, q]
                        By[k, i, j] += temp_y[m, q] * exprx[k, m, q]
                        Bz[k, i, j] += temp_z[m, q] * exprx[k, m, q]


def calcB_numba(B_x0, alpha=0.0,
          x_min = 0.0, x_max = 0.25,
          y_min = 0.0, y_max = 1.0,
          z_min = 0.0, z_max = 1.0):

    n = len(B_x0)

    x = np.linspace(x_min,x_max,num=n,endpoint=False).astype(B_x0.dtype,copy=False)
    y = np.linspace(y_min,y_max,num=n).astype(B_x0.dtype,copy=False)
    z = np.linspace(z_min,z_max,num=n).astype(B_x0.dtype,copy=False)
    u = np.arange(n,dtype=B_x0.dtype)

    # Making C
    C = 4.0 / (n-1.0)**2 * np.sum(np.sum((B_x0 * np.sin(pi/y_max * u * y[:,None])[:,:,None])[:,None] * np.sin(pi/z_max * u * z[:,None])[:,None],-1),-1)
    l = pi**2 * ((u**2 / y_max)[:,None] + (u**2 / z_max))
    l[0,0] = 1.0
    r = np.sqrt(l - alpha**2)
    exprx = np.exp((-r * x[:, None, None]))

    # Calculating B
    Bx = np.empty((n,n,n),dtype=B_x0.dtype)
    By = np.empty((n,n,n),dtype=B_x0.dtype)
    Bz = np.empty((n,n,n),dtype=B_x0.dtype)
    temp_x = np.empty((n, n), dtype=B_x0.dtype)
    temp_y = np.empty((n, n), dtype=B_x0.dtype)
    temp_z = np.empty((n, n), dtype=B_x0.dtype)
    loop(n, Bx, By, Bz, u, y, y_max, z, z_max, alpha, C, l, r, temp_x, temp_y, temp_z, exprx)
    return (Bx, By, Bz)


def main():
    B = util.Benchmark()
    if B.inputfn is None:
        B_x0 = B.random_array((B.size[0],B.size[1]), dtype=B.dtype)
    else:
        inputfn = B.inputfn if B.inputfn else '../idl_input-float64_512*512.npz'
        sd = { 512:1, 256:2, 128:4, 64:8, 32:16, 16:32, 8:64}
        try:
            h = sd[B.size[0]]
            w = sd[B.size[1]]
        except KeyError:
            raise ValueError('Only valid sizes are: '+str(sd.keys()))
        B_x0 = B.load_array(inputfn, 'input', dtype=B.dtype)[::h,::w]

    B.start()
    for _ in range(B.size[2]):
        Rx, Ry, Rz = calcB_numba(window(B_x0.copy()))
        """
        Rx1, Ry1, Rz1 = calcB_loop(window(B_x0.copy()))
        assert np.allclose(Rx, Rx1)
        assert np.allclose(Ry, Ry1)
        assert np.allclose(Rz, Rz1)
        """
    B.stop()
    B.pprint()

    if B.outputfn:
        R = Rx+Ry+Rz
        B.tofile(B.outputfn, {'res': R, 'res_x': Rx, 'res_y': Ry, 'res_z': Rz})

if __name__ == '__main__':
    main()

Python Numpy

# This Python file uses the following encoding: utf-8
from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np
import math

bench = util.Benchmark("Magnetic Field Extrapolation", "<x-size>*<y-size>*<iterations>")


def window(B, a=0.37):
    assert (len(B.shape) == 2)
    assert (B.shape[0] == B.shape[1])
    n = B.shape[0]
    wl = np.ones_like(B[0])
    b = int(np.ceil((a * (n - 1) / 2)))
    wl[:b] = 0.5 * (1 + np.cos(math.pi * (2 * np.arange(b) / (a * (n - 1)) - 1)))
    wl[-b:] = 0.5 * (1 + np.cos(math.pi * (2 * np.arange(b - 1, -1, -1) / (a * (n - 1)) - 1)))
    wl *= wl
    w = np.sqrt(wl + wl[:, None])
    return B * w


def calcB(B_x0, alpha=0.0,
          x_min=0.0, x_max=0.25,
          y_min=0.0, y_max=1.0,
          z_min=0.0, z_max=1.0):
    n = len(B_x0)
    x = np.linspace(x_min, x_max, num=n, endpoint=False).astype(B_x0.dtype, copy=False)
    y = np.linspace(y_min, y_max, num=n).astype(B_x0.dtype, copy=False)
    z = np.linspace(z_min, z_max, num=n).astype(B_x0.dtype, copy=False)
    u = np.arange(n, dtype=B_x0.dtype)

    # Making C
    C = 4.0 / (n - 1.0) ** 2 * np.sum(np.sum(
        (B_x0 * np.sin(math.pi / y_max * u * y[:, None])[:, :, None])[:, None] * np.sin(
            math.pi / z_max * u * z[:, None])[:, None], -1), -1)
    l = np.pi ** 2 * ((u ** 2 / y_max)[:, None] + (u ** 2 / z_max))
    l[0, 0] = 1.0
    r = np.sqrt(l - alpha ** 2)

    # Calculating B
    sincos = np.sin(math.pi / y_max * u * y[:, None])[:, None, :, None] * (u * np.cos(
        math.pi / z_max * u * z[:, None]))[None, :, None, :]
    cossin = (u * np.cos(math.pi / y_max * u * y[:, None]))[:, None, :, None] * np.sin(
        math.pi / z_max * u * z[:, None])[None, :, None, :]
    temp_x = C * np.sin(math.pi / y_max * u * y[:, None])[:, None, :, None] * np.sin(math.pi / z_max * u * z[:, None])[
                                                                              None, :, None, :]
    temp_y = C / l * (alpha * math.pi / z_max * sincos - r * math.pi / y_max * cossin)
    temp_z = C / l * (alpha * math.pi / y_max * cossin + r * math.pi / z_max * sincos)
    exprx = np.exp((-r * x[:, None, None]))

    Bx = np.sum(np.sum(temp_x * exprx[:, None, None], -1), -1)
    By = np.sum(np.sum(temp_y * exprx[:, None, None], -1), -1)
    Bz = np.sum(np.sum(temp_z * exprx[:, None, None], -1), -1)
    return (Bx, By, Bz)


def main():
    B_x0 = bench.load_data()
    if B_x0 is None:
        B_x0 = bench.random_array((bench.args.size[0], bench.args.size[1]))
    else:
        sd = {512: 1, 256: 2, 128: 4, 64: 8, 32: 16, 16: 32}
        try:
            h = sd[bench.args.size[0]]
            w = sd[bench.args.size[1]]
        except KeyError:
            raise ValueError('Only valid sizes are: ' + str(sd.keys()))
        B_x0 = B_x0['input'][::h, ::w]

    bench.start()
    for _ in range(bench.args.size[2]):
        Rx, Ry, Rz = calcB(window(B_x0))
        bench.flush()
    bench.stop()
    bench.pprint()


if __name__ == '__main__':
    main()

Montecarlo Pi

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Monte Carlo Pi", "samples*iterations")


def montecarlo_pi(samples):
    x = bench.random_array((samples,))
    y = bench.random_array((samples,))
    m = np.sqrt(x * x + y * y) <= 1.0
    return np.sum(m) * 4.0 / samples


def solve(samples, iterations):
    acc = 0.0
    for _ in range(iterations):
        acc += montecarlo_pi(samples)
        bench.flush()
    acc /= iterations
    return acc


def main():
    samples, iterations = bench.args.size
    bench.start()
    res = solve(samples, iterations)
    bench.stop()
    bench.pprint()
    if bench.args.verbose:
        print(res)


if __name__ == "__main__":
    main()

C99 Omp

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double montecarlo_pi(uint32_t samples, uint32_t x_index, uint32_t y_index, uint32_t key)
{
    uint64_t darts = 0;
    
    #pragma omp parallel for reduction(+:darts)
    for (uint32_t eidx=0; eidx<samples; ++eidx) {

        uint64_t x_raw = ((philox2x32_as_1x64_t)philox2x32(
          ((philox2x32_as_1x64_t)( (uint64_t)(x_index + eidx) )).orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double x = x_raw / 18446744073709551616.000000;

        uint64_t y_raw = ((philox2x32_as_1x64_t)philox2x32(
          ((philox2x32_as_1x64_t)( (uint64_t)(y_index + eidx) )).orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double y = y_raw / 18446744073709551616.000000;

        darts += sqrt(x*x + y*y) <= 1;
    }
    return (double)darts * (double)4.0 / (double)samples;
}

double run_montecarlo_pi(int64_t samples, int64_t iterations)
{
    uint64_t index = 0;                 // Philox index
    const uint64_t key = 0;             // Philox key
    uint64_t x_index = 0;
    uint64_t y_index = 0;

    double pi_accu = 0.0;               // Accumulation over PI-approximations.
    for(int64_t i=0; i<iterations; ++i) {
        x_index = index;
        index += samples;
        y_index = index;
        index += samples;
        pi_accu += montecarlo_pi(samples, x_index, y_index, key);
    }
    pi_accu /= iterations;              // Approximated value of PI

    return pi_accu;
}

int main(int argc, char** argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int samples = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    bp.timer_start();
    double pi = run_montecarlo_pi(samples, iterations);
    bp.timer_stop();

    bp.print("montecarlo_pi(c99_omp)");
    if (bp.args.verbose) {
        printf("PI-approximation = %f\n", pi);
    }

    return 0;
}

C99 Seq

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double montecarlo_pi(uint32_t samples, uint32_t x_index, uint32_t y_index, uint32_t key)
{
    uint64_t darts = 0;
    
    for (uint32_t eidx=0; eidx<samples; ++eidx) {

        uint64_t x_raw = ((philox2x32_as_1x64_t)philox2x32(
          ((philox2x32_as_1x64_t)( (uint64_t)(x_index + eidx) )).orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double x = x_raw / 18446744073709551616.000000;

        uint64_t y_raw = ((philox2x32_as_1x64_t)philox2x32(
          ((philox2x32_as_1x64_t)( (uint64_t)(y_index + eidx) )).orig,
          (philox2x32_key_t){ { key } }
        )).combined;
        double y = y_raw / 18446744073709551616.000000;

        darts += sqrt(x*x + y*y) <= 1;
    }
    return (double)darts * (double)4.0 / (double)samples;
}

double run_montecarlo_pi(int64_t samples, int64_t iterations)
{
    uint64_t index = 0;                 // Philox index
    const uint64_t key = 0;             // Philox key
    uint64_t x_index = 0;
    uint64_t y_index = 0;

    double pi_accu = 0.0;               // Accumulation over PI-approximations.
    for(int64_t i=0; i<iterations; ++i) {
        x_index = index;
        index += samples;
        y_index = index;
        index += samples;
        pi_accu += montecarlo_pi(samples, x_index, y_index, key);
    }
    pi_accu /= iterations;              // Approximated value of PI

    return pi_accu;
}

int main(int argc, char** argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int samples = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    bp.timer_start();
    double pi = run_montecarlo_pi(samples, iterations);
    bp.timer_stop();

    bp.print("montecarlo_pi(c99_seq)");
    if (bp.args.verbose) {
        printf("PI-approximation = %f\n", pi);
    }

    return 0;
}

Cpp11 Omp

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double montecarlo_pi(uint32_t samples, uint32_t x_index, uint32_t y_index, uint32_t key)
{
    uint64_t darts = 0;
    
    #pragma omp parallel for reduction(+:darts)
    for (uint32_t eidx=0; eidx<samples; ++eidx) {

        philox2x32_as_1x64_t x_counter;
        x_counter.combined = x_index + eidx;

        philox2x32_as_1x64_t x_philox;
        x_philox.orig = philox2x32(
          x_counter.orig,
          (philox2x32_key_t){ { key } }
        );
        double x = x_philox.combined / 18446744073709551616.000000;

        philox2x32_as_1x64_t y_counter;
        y_counter.combined = y_index + eidx;
        
        philox2x32_as_1x64_t y_philox;
        y_philox.orig = philox2x32(
          y_counter.orig,
          (philox2x32_key_t){ { key } }
        );
        double y = y_philox.combined / 18446744073709551616.000000;

        darts += sqrt(x*x + y*y) <= 1;
    }
    return (double)darts * (double)4.0 / (double)samples;
}

double run_montecarlo_pi(int64_t samples, int64_t iterations)
{
    uint64_t index = 0;                 // Philox index
    const uint64_t key = 0;             // Philox key
    uint64_t x_index = 0;
    uint64_t y_index = 0;

    double pi_accu = 0.0;               // Accumulation over PI-approximations.
    for(int64_t i=0; i<iterations; ++i) {
        x_index = index;
        index += samples;
        y_index = index;
        index += samples;
        pi_accu += montecarlo_pi(samples, x_index, y_index, key);
    }
    pi_accu /= iterations;              // Approximated value of PI

    return pi_accu;
}

int main(int argc, char** argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int samples = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    bp.timer_start();
    double pi = run_montecarlo_pi(samples, iterations);
    bp.timer_stop();

    bp.print("montecarlo_pi(cpp11_omp)");
    if (bp.args.verbose) {
        printf("PI-approximation = %f\n", pi);
    }

    return 0;
}

Cpp11 Seq

#include <inttypes.h>
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <Random123/philox.h>
#include <bp_util.h>

typedef union philox2x32_as_1x64 {
    philox2x32_ctr_t orig;
    uint64_t combined;
} philox2x32_as_1x64_t;

double montecarlo_pi(uint32_t samples, uint32_t x_index, uint32_t y_index, uint32_t key)
{
    uint64_t darts = 0;
    
    for (uint32_t eidx=0; eidx<samples; ++eidx) {

        philox2x32_as_1x64_t x_counter;
        x_counter.combined = x_index + eidx;

        philox2x32_as_1x64_t x_philox;
        x_philox.orig = philox2x32(
          x_counter.orig,
          (philox2x32_key_t){ { key } }
        );
        double x = x_philox.combined / 18446744073709551616.000000;

        philox2x32_as_1x64_t y_counter;
        y_counter.combined = y_index + eidx;
        
        philox2x32_as_1x64_t y_philox;
        y_philox.orig = philox2x32(
          y_counter.orig,
          (philox2x32_key_t){ { key } }
        );
        double y = y_philox.combined / 18446744073709551616.000000;

        darts += sqrt(x*x + y*y) <= 1;
    }
    return (double)darts * (double)4.0 / (double)samples;
}

double run_montecarlo_pi(int64_t samples, int64_t iterations)
{
    uint64_t index = 0;                 // Philox index
    const uint64_t key = 0;             // Philox key
    uint64_t x_index = 0;
    uint64_t y_index = 0;

    double pi_accu = 0.0;               // Accumulation over PI-approximations.
    for(int64_t i=0; i<iterations; ++i) {
        x_index = index;
        index += samples;
        y_index = index;
        index += samples;
        pi_accu += montecarlo_pi(samples, x_index, y_index, key);
    }
    pi_accu /= iterations;              // Approximated value of PI

    return pi_accu;
}

int main(int argc, char** argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int samples = bp.args.sizes[0];
    const int iterations = bp.args.sizes[1];

    bp.timer_start();
    double pi = run_montecarlo_pi(samples, iterations);
    bp.timer_stop();

    bp.print("montecarlo_pi(cpp11_seq)");
    if (bp.args.verbose) {
        printf("PI-approximation = %f\n", pi);
    }

    return 0;
}

Nbody

Python Numpy

from __future__ import print_function

"""
NBody in N^2 complexity

Note that we are using only Newtonian forces and do not consider relativity
Neither do we consider collisions between stars
Thus some of our stars will accelerate to speeds beyond c
This is done to keep the simulation simple enough for teaching purposes

All the work is done in the calc_force, move and random_galaxy functions.
To vectorize the code these are the functions to transform.
"""
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Gaussian elimination on matrix a without pivoting", "<size>")

G = 6.67384e-11  # m/(kg*s^2)
dt = 60 * 60 * 24 * 365.25  # Years in seconds
r_ly = 9.4607e15  # Lightyear in m
m_sol = 1.9891e30  # Solar mass in kg


def diagonal(ary, offset=0):
    if ary.ndim != 2:
        raise Exception("diagonal only supports 2 dimensions\n")
    if offset < 0:
        offset = -offset
        if (ary.shape[0] - offset) > ary.shape[1]:
            ary_diag = ary[offset, :]
        else:
            ary_diag = ary[offset:, 0]
    else:
        if ary.shape[1] - offset > ary.shape[0]:
            ary_diag = ary[:, offset]
        else:
            ary_diag = ary[0, offset:]
    ary_diag.strides = (ary.strides[0] + ary.strides[1],)
    return ary_diag


def random_galaxy(N):
    """Generate a galaxy of random bodies"""

    galaxy = {  # We let all bodies stand still initially
        'm': (bench.random_array((N,)) + 10) * m_sol / 10,
        'x': (bench.random_array((N,)) - 0.5) * r_ly / 100,
        'y': (bench.random_array((N,)) - 0.5) * r_ly / 100,
        'z': (bench.random_array((N,)) - 0.5) * r_ly / 100,
        'vx': np.zeros(N, dtype=bench.dtype),
        'vy': np.zeros(N, dtype=bench.dtype),
        'vz': np.zeros(N, dtype=bench.dtype)
    }
    if bench.dtype == np.float32:
        galaxy['m'] /= 1e10
        galaxy['x'] /= 1e5
        galaxy['y'] /= 1e5
        galaxy['z'] /= 1e5
    return galaxy


def move(galaxy, dt):
    """Move the bodies
    first find forces and change velocity and then move positions
    """
    n = len(galaxy['x'])
    # Calculate all dictances component wise (with sign)
    dx = galaxy['x'][np.newaxis, :].T - galaxy['x']
    dy = galaxy['y'][np.newaxis, :].T - galaxy['y']
    dz = galaxy['z'][np.newaxis, :].T - galaxy['z']

    # Euclidian distances (all bodys)
    r = np.sqrt(dx ** 2 + dy ** 2 + dz ** 2)
    diagonal(r)[:] = 1.0

    # prevent collision
    mask = r < 1.0
    r = r * ~mask + 1.0 * mask

    m = galaxy['m'][np.newaxis, :].T

    # Calculate the acceleration component wise
    Fx = G * m * dx / r ** 3
    Fy = G * m * dy / r ** 3
    Fz = G * m * dz / r ** 3
    # Set the force (acceleration) a body exerts on it self to zero
    diagonal(Fx)[:] = 0.0
    diagonal(Fy)[:] = 0.0
    diagonal(Fz)[:] = 0.0

    galaxy['vx'] += dt * np.sum(Fx, axis=0)
    galaxy['vy'] += dt * np.sum(Fy, axis=0)
    galaxy['vz'] += dt * np.sum(Fz, axis=0)

    galaxy['x'] += dt * galaxy['vx']
    galaxy['y'] += dt * galaxy['vy']
    galaxy['z'] += dt * galaxy['vz']


def main():
    nbodies, niters = bench.args.size
    galaxy = random_galaxy(nbodies)

    bench.start()
    for i in range(niters):
        move(galaxy, dt)
        bench.flush()
    bench.stop()
    bench.pprint()


if __name__ == "__main__":
    main()

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as
published by the Free Software Foundation, either version 3
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the
GNU Lesser General Public License along with Bohrium.

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using R = NumCIL.Range;

namespace nbody
{
	using NumCIL.Double;
	using DATA = System.Double;

	public static class nBodySolverDouble
	{
		//Gravity
		private static DATA G = (DATA)1.0;

		private static void FillDiagonal(NdArray a, DATA val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Galaxy b)
		{
			var dx = b.x - b.x[R.NewAxis, R.All].Transposed;
			FillDiagonal(dx, 1);
			var dy = b.y - b.y[R.NewAxis, R.All].Transposed;
			FillDiagonal(dy, 1);
			var dz = b.z - b.z[R.NewAxis, R.All].Transposed;
			FillDiagonal(dz, 1);
			var pm = b.mass - b.mass[R.NewAxis, R.All].Transposed;
			FillDiagonal(pm, 0);

			var r = (dx.Pow(2) + dy.Pow(2) + dz.Pow(2)).Pow((DATA)0.5);

			//In the below calc of the the forces the force of a body upon itself
			//becomes nan and thus destroys the data
			var Fx = G * pm / r.Pow(2) * (dx / r);
			var Fy = G * pm / r.Pow(2) * (dy / r);
			var Fz = G * pm / r.Pow(2) * (dz / r);

			//The diagonal nan numbers must be removed so that the force from a body
			//upon itself is zero
			FillDiagonal(Fx,0);
			FillDiagonal(Fy,0);
			FillDiagonal(Fz,0);

			b.vx += Add.Reduce(Fx, 1) / b.mass;
			b.vy += Add.Reduce(Fy, 1) / b.mass;
			b.vz += Add.Reduce(Fz, 1) / b.mass;
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy);

			galaxy.x += galaxy.vx;
			galaxy.y += galaxy.vy;
			galaxy.z += galaxy.vz;
		}

		public class Galaxy
		{
			private const DATA XMAX = 500;
			private const DATA YMAX = 500;
			private const DATA ZMAX = 500;

			public NdArray mass;
			public NdArray x;
			public NdArray y;
			public NdArray z;
			public NdArray vx;
			public NdArray vy;
			public NdArray vz;

			public Galaxy(long size)
			{
				this.mass = Generate.Random(size) * (DATA)Math.Pow(10, 6) / (DATA)(4 * Math.PI * Math.PI);
				this.x = Generate.Random(size) * 2 * XMAX - XMAX;
				this.y = Generate.Random(size) * 2 * YMAX - YMAX;
				this.z = Generate.Random(size) * 2 * ZMAX - ZMAX;
				this.vx = Generate.Zeroes(size);
				this.vy = Generate.Zeroes(size);
				this.vz = Generate.Zeroes(size);
			}

			public DATA Sync()
			{
				return this.z.Value[0] + this.y.Value[0] + this.x.Value[0];
			}
		}

		public static void Solve(Galaxy galaxy, long steps)
		{
			for(long i = 0; i < steps; i++)
				Move(galaxy);
		}
	}
}

namespace nbody
{
	using NumCIL.Double;
	using DATA = System.Double;

	public static class nBodySolverDoubleNoTempArrays
	{
		//Gravity
		private const DATA G = (DATA)1.0;
		private const DATA MIN_DIST = (DATA)1.0;

		private static void FillDiagonal(NdArray a, DATA val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Galaxy b)
		{
			// Calculate all dictances component wise (with sign)
			b.x.Sub(b.x[R.NewAxis, R.All].Transposed, b.dx);
			b.y.Sub(b.y[R.NewAxis, R.All].Transposed, b.dy);
			b.z.Sub(b.z[R.NewAxis, R.All].Transposed, b.dz);

			// Euclidian distances (all bodys)
			b.dx.Pow(2, b.r0);
			b.dy.Pow(2, b.r1);
			b.r0.Add(b.r1, b.r0);
			b.dz.Pow(2, b.r1);
			b.r0.Add(b.r1, b.r0);
			b.r0.Pow((DATA)0.5, b.r0);
			FillDiagonal(b.r0, (Double)1.0);

			// prevent collition
			NumCIL.Double.GreaterThanOrEqual.Apply(b.r0, MIN_DIST, b.mask);
			b.mask.ToDouble (b.r1);
			b.r1.Mul (b.r0, b.r0);

			NumCIL.Boolean.Not.Apply (b.mask, b.mask);
			b.mask.ToDouble (b.r1);
			b.r1.Mul (MIN_DIST, b.r1);
			b.r1.Add (b.r0, b.r0);

			//Calculate the acceleration component wise
			var m = b.mass[R.NewAxis, R.All].Transposed;
			b.dx.Mul(G,b.dx);
			b.dy.Mul(G,b.dy);
			b.dz.Mul(G,b.dz);
			b.dx.Mul(m,b.dx);
			b.dy.Mul(m,b.dy);
			b.dz.Mul(m,b.dz);
			b.r0.Pow(3, b.r0);
			b.dx.Div(b.r0, b.dx);
			b.dy.Div(b.r0, b.dy);
			b.dz.Div(b.r0, b.dz);

			// Set the force (acceleration) a body exerts on it self to zero
			FillDiagonal(b.dx, 0);
			FillDiagonal(b.dy, 0);
			FillDiagonal(b.dz, 0);

			Double dt = (Double) (60*60*24*365.25); // Years in seconds
			b.dx.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vx.Add(b.t0, b.vx);
			b.dy.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vy.Add(b.t0, b.vy);
			b.dz.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vz.Add(b.t0, b.vz);
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy);

			galaxy.x.Add(galaxy.vx, galaxy.x);
			galaxy.y.Add(galaxy.vy, galaxy.y);
			galaxy.z.Add(galaxy.vz, galaxy.z);
		}

		public class Galaxy
		{
			private const DATA XMAX = 500;
			private const DATA YMAX = 500;
			private const DATA ZMAX = 500;

			public NdArray mass;
			public NdArray x;
			public NdArray y;
			public NdArray z;
			public NdArray vx;
			public NdArray vy;
			public NdArray vz;
			public NdArray dx;
			public NdArray dy;
			public NdArray dz;
			public NdArray pmass;
			public NdArray r0;
			public NdArray r1;
			public NdArray t0;
			public NumCIL.Boolean.NdArray mask;

			public Galaxy(long size)
			{
				this.mass = Generate.Random(size) * (DATA)Math.Pow(10, 6) / (DATA)(4 * Math.PI * Math.PI);
				this.x = Generate.Random(size) * 2 * XMAX - XMAX;
				this.y = Generate.Random(size) * 2 * YMAX - YMAX;
				this.z = Generate.Random(size) * 2 * ZMAX - ZMAX;
				this.vx = Generate.Zeroes(size);
				this.vy = Generate.Zeroes(size);
				this.vz = Generate.Zeroes(size);
				this.dx = new NdArray(new [] {size, size});
				this.dy = new NdArray(new [] {size, size});
				this.dz = new NdArray(new [] {size, size});
				this.pmass = new NdArray(new [] {size, size});
				this.r0 = new NdArray(this.dx.Shape);
				this.r1 = new NdArray(this.dx.Shape);
				this.t0 = new NdArray(this.x.Shape);
				this.mask = new NumCIL.Boolean.NdArray(this.dx.Shape);
			}

			public DATA Sync()
			{
				return this.z.Value[0] + this.y.Value[0] + this.x.Value[0];
			}
		}

		public static void Solve(Galaxy galaxy, long steps)
		{
			for(long i = 0; i < steps; i++)
				Move(galaxy);
		}
	}
}

namespace nbody
{
	using NumCIL.Float;
	using DATA = System.Single;

	public static class nBodySolverFloat
	{
		//Gravity
		private static DATA G = (DATA)1.0;

		private static void FillDiagonal(NdArray a, DATA val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Galaxy b)
		{
			var dx = b.x - b.x[R.NewAxis, R.All].Transposed;
			FillDiagonal(dx, 1);
			var dy = b.y - b.y[R.NewAxis, R.All].Transposed;
			FillDiagonal(dy, 1);
			var dz = b.z - b.z[R.NewAxis, R.All].Transposed;
			FillDiagonal(dz, 1);
			var pm = b.mass - b.mass[R.NewAxis, R.All].Transposed;
			FillDiagonal(pm, 0);

			var r = (dx.Pow(2) + dy.Pow(2) + dz.Pow(2)).Pow((DATA)0.5);

			//In the below calc of the the forces the force of a body upon itself
			//becomes nan and thus destroys the data
			var Fx = G * pm / r.Pow(2) * (dx / r);
			var Fy = G * pm / r.Pow(2) * (dy / r);
			var Fz = G * pm / r.Pow(2) * (dz / r);

			//The diagonal nan numbers must be removed so that the force from a body
			//upon itself is zero
			FillDiagonal(Fx,0);
			FillDiagonal(Fy,0);
			FillDiagonal(Fz,0);

			b.vx += Add.Reduce(Fx, 1) / b.mass;
			b.vy += Add.Reduce(Fy, 1) / b.mass;
			b.vz += Add.Reduce(Fz, 1) / b.mass;
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy);

			galaxy.x += galaxy.vx;
			galaxy.y += galaxy.vy;
			galaxy.z += galaxy.vz;
		}

		public class Galaxy
		{
			private const DATA XMAX = 500;
			private const DATA YMAX = 500;
			private const DATA ZMAX = 500;

			public NdArray mass;
			public NdArray x;
			public NdArray y;
			public NdArray z;
			public NdArray vx;
			public NdArray vy;
			public NdArray vz;

			public Galaxy(long size)
			{
				this.mass = Generate.Random(size) * (DATA)Math.Pow(10, 6) / (DATA)(4 * Math.PI * Math.PI);
				this.x = Generate.Random(size) * 2 * XMAX - XMAX;
				this.y = Generate.Random(size) * 2 * YMAX - YMAX;
				this.z = Generate.Random(size) * 2 * ZMAX - ZMAX;
				this.vx = Generate.Zeroes(size);
				this.vy = Generate.Zeroes(size);
				this.vz = Generate.Zeroes(size);
			}

			public DATA Sync()
			{
				return this.z.Value[0] + this.y.Value[0] + this.x.Value[0];
			}
		}

		public static void Solve(Galaxy galaxy, long steps)
		{
			for(long i = 0; i < steps; i++)
				Move(galaxy);
		}
	}
}

namespace nbody
{
	using NumCIL.Float;
	using DATA = System.Single;

	public static class nBodySolverFloatNoTempArrays
	{
		//Gravity
		private const DATA G = (DATA)1.0;
		private const DATA MIN_DIST = (DATA)1.0;

		private static void FillDiagonal(NdArray a, DATA val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Galaxy b)
		{
			// Calculate all dictances component wise (with sign)
			b.x.Sub(b.x[R.NewAxis, R.All].Transposed, b.dx);
			b.y.Sub(b.y[R.NewAxis, R.All].Transposed, b.dy);
			b.z.Sub(b.z[R.NewAxis, R.All].Transposed, b.dz);

			// Euclidian distances (all bodys)
			b.dx.Pow(2, b.r0);
			b.dy.Pow(2, b.r1);
			b.r0.Add(b.r1, b.r0);
			b.dz.Pow(2, b.r1);
			b.r0.Add(b.r1, b.r0);
			b.r0.Pow((DATA)0.5, b.r0);
			FillDiagonal(b.r0, (Single)1.0);

			// prevent collition
			NumCIL.Float.GreaterThanOrEqual.Apply(b.r0, MIN_DIST, b.mask);
			b.mask.ToFloat (b.r1);
			b.r1.Mul (b.r0, b.r0);

			NumCIL.Boolean.Not.Apply (b.mask, b.mask);
			b.mask.ToFloat (b.r1);
			b.r1.Mul (MIN_DIST, b.r1);
			b.r1.Add (b.r0, b.r0);

			//Calculate the acceleration component wise
			var m = b.mass[R.NewAxis, R.All].Transposed;
			b.dx.Mul(G,b.dx);
			b.dy.Mul(G,b.dy);
			b.dz.Mul(G,b.dz);
			b.dx.Mul(m,b.dx);
			b.dy.Mul(m,b.dy);
			b.dz.Mul(m,b.dz);
			b.r0.Pow(3, b.r0);
			b.dx.Div(b.r0, b.dx);
			b.dy.Div(b.r0, b.dy);
			b.dz.Div(b.r0, b.dz);

			// Set the force (acceleration) a body exerts on it self to zero
			FillDiagonal(b.dx, 0);
			FillDiagonal(b.dy, 0);
			FillDiagonal(b.dz, 0);

			Single dt = (Single) (60*60*24*365.25); // Years in seconds
			b.dx.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vx.Add(b.t0, b.vx);
			b.dy.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vy.Add(b.t0, b.vy);
			b.dz.Reduce<Add>(0, b.t0);
			b.t0.Mul(dt, b.t0);
			b.vz.Add(b.t0, b.vz);
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy);

			galaxy.x.Add(galaxy.vx, galaxy.x);
			galaxy.y.Add(galaxy.vy, galaxy.y);
			galaxy.z.Add(galaxy.vz, galaxy.z);
		}

		public class Galaxy
		{
			private const DATA XMAX = 500;
			private const DATA YMAX = 500;
			private const DATA ZMAX = 500;

			public NdArray mass;
			public NdArray x;
			public NdArray y;
			public NdArray z;
			public NdArray vx;
			public NdArray vy;
			public NdArray vz;
			public NdArray dx;
			public NdArray dy;
			public NdArray dz;
			public NdArray pmass;
			public NdArray r0;
			public NdArray r1;
			public NdArray t0;
			public NumCIL.Boolean.NdArray mask;

			public Galaxy(long size)
			{
				this.mass = Generate.Random(size) * (DATA)Math.Pow(10, 6) / (DATA)(4 * Math.PI * Math.PI);
				this.x = Generate.Random(size) * 2 * XMAX - XMAX;
				this.y = Generate.Random(size) * 2 * YMAX - YMAX;
				this.z = Generate.Random(size) * 2 * ZMAX - ZMAX;
				this.vx = Generate.Zeroes(size);
				this.vy = Generate.Zeroes(size);
				this.vz = Generate.Zeroes(size);
				this.dx = new NdArray(new [] {size, size});
				this.dy = new NdArray(new [] {size, size});
				this.dz = new NdArray(new [] {size, size});
				this.pmass = new NdArray(new [] {size, size});
				this.r0 = new NdArray(this.dx.Shape);
				this.r1 = new NdArray(this.dx.Shape);
				this.t0 = new NdArray(this.x.Shape);
				this.mask = new NumCIL.Boolean.NdArray(this.dx.Shape);
			}

			public DATA Sync()
			{
				return this.z.Value[0] + this.y.Value[0] + this.x.Value[0];
			}
		}

		public static void Solve(Galaxy galaxy, long steps)
		{
			for(long i = 0; i < steps; i++)
				Move(galaxy);
		}
	}
}

Nbody Nice

Python Numpy

Error

There are issues with the implementation.

Visualization does not work when running with Bohrium since it relies on NumPy masked array.

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
NBody in N^2 complexity

Note that we are using only Newtonian forces and do not consider relativity
Neither do we consider collisions between stars
Thus some of our stars will accelerate to speeds beyond c
This is done to keep the simulation simple enough for teaching purposes

"""
from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

try:
    import numpy_force as npf
except ImportError:
    import numpy as npf

from nbody_nice_visualization import gfx_init, gfx_show

bench = util.Benchmark("n-body using the NICE method", "<image-size>*<filter-size>*<ndims>*<niters>")


def fill_diagonal(a, val):
    d, _ = a.shape  # This only makes sense for square matrices
    a.shape = d * d  # Flatten a without making a copy
    a[::d + 1] = val  # Assign the diagonal values
    a.shape = (d, d)  # Return a to its original shape


def calc_force(a, b, dt):
    """
    Calculate forces between bodies
    F = ((G m_a m_b)/r^2)/((x_b-x_a)/r)
    """
    # Ignore division by zero since we fix it explicitely by setting the diagonal in the forces arrays
    npf.seterr(divide='ignore', invalid='ignore')

    G = 6.673e-11

    dx = b['x'] - a['x'][:, None]
    dy = b['y'] - a['y'][:, None]
    dz = b['z'] - a['z'][:, None]
    pm = b['m'] * a['m'][:, None]

    #
    # For some reason then this pow(T, 0.5) is deadly to performance...
    # sqrt(T) is equivalent math, trying it out instead.
    #
    # This might actually be a neat optimization:
    # pow(T, 0.K) => k-root(T)
    #
    # r = ( dx ** 2 + dy ** 2 + dz ** 2) ** 0.5
    r = np.sqrt(dx ** 2 + dy ** 2 + dz ** 2)

    Fx = G * pm / r ** 2 * (dx / r)
    Fy = G * pm / r ** 2 * (dy / r)
    Fz = G * pm / r ** 2 * (dz / r)

    # The diagonal nan numbers must be removed so that the force from a body
    # upon itself is zero
    if a is b:
        fill_diagonal(Fx, 0.)
        fill_diagonal(Fy, 0.)
        fill_diagonal(Fz, 0.)

    a['vx'] += np.add.reduce(Fx, axis=1) / a['m'] * dt
    a['vy'] += np.add.reduce(Fy, axis=1) / a['m'] * dt
    a['vz'] += np.add.reduce(Fz, axis=1) / a['m'] * dt


def move(solarsystem, asteroids, dt):
    """
    Move the bodies
    first find forces and change velocity and then move positions
    """
    calc_force(solarsystem, solarsystem, dt)
    calc_force(asteroids, solarsystem, dt)
    solarsystem['x'] += solarsystem['vx'] * dt
    solarsystem['y'] += solarsystem['vy'] * dt
    solarsystem['z'] += solarsystem['vz'] * dt

    asteroids['x'] += asteroids['vx'] * dt
    asteroids['y'] += asteroids['vy'] * dt
    asteroids['z'] += asteroids['vz'] * dt


def random_system(x_max, y_max, z_max, n, b, dtype=npf.float):
    """Generate a galaxy of random bodies"""

    solarmass = 1.98892e30

    def circlev(rx, ry, rz):
        """Helper function..."""
        r2 = npf.sqrt(rx * rx + ry * ry + rz * rz)
        numerator = (6.67e-11) * 1e6 * solarmass
        return npf.sqrt(numerator / r2)

    solarsystem = {}

    solarsystem['x'] = npf.random.random(n)
    solarsystem['y'] = npf.random.random(n)
    solarsystem['z'] = npf.random.random(n) * .01
    dist = (1.0 / npf.sqrt(solarsystem['x'] ** 2 + solarsystem['y'] ** 2 + solarsystem['z'] ** 2)) - (
            0.8 - npf.random.random() * .1)

    solarsystem['x'] = x_max * solarsystem['x'] * dist * npf.sign(.5 - npf.random.random(n))
    solarsystem['y'] = y_max * solarsystem['y'] * dist * npf.sign(.5 - npf.random.random(n))
    solarsystem['z'] = z_max * solarsystem['z'] * dist * npf.sign(.5 - npf.random.random(n))
    magv = circlev(
        solarsystem['x'],
        solarsystem['y'],
        solarsystem['z']
    )

    absangle = npf.arctan(npf.absolute(solarsystem['y'] / solarsystem['x']))
    thetav = npf.pi / 2 - absangle
    solarsystem['vx'] = -1 * npf.sign(solarsystem['y']) * npf.cos(thetav) * magv
    solarsystem['vy'] = npf.sign(solarsystem['x']) * npf.sin(thetav) * magv
    solarsystem['vz'] = npf.zeros(n)
    solarsystem['m'] = npf.random.random(n) * solarmass * 10 + 1e20;

    solarsystem['m'][0] = 1e6 * solarmass
    solarsystem['x'][0] = 0
    solarsystem['y'][0] = 0
    solarsystem['z'][0] = 0
    solarsystem['vx'][0] = 0
    solarsystem['vy'][0] = 0
    solarsystem['vz'][0] = 0

    asteroids = {}
    asteroids['x'] = npf.random.random(b)
    asteroids['y'] = npf.random.random(b)
    asteroids['z'] = npf.random.random(b) * .01
    dist = (1.0 / npf.sqrt(asteroids['x'] ** 2 + asteroids['y'] ** 2 + asteroids['z'] ** 2)) - (
            npf.random.random() * .2)
    asteroids['x'] = x_max * asteroids['x'] * dist * npf.sign(.5 - npf.random.random(b))
    asteroids['y'] = y_max * asteroids['y'] * dist * npf.sign(.5 - npf.random.random(b))
    asteroids['z'] = z_max * asteroids['z'] * dist * npf.sign(.5 - npf.random.random(b))
    magv = circlev(
        asteroids['x'],
        asteroids['y'],
        asteroids['z']
    )

    absangle = npf.arctan(npf.absolute(asteroids['y'] / asteroids['x']))
    thetav = npf.pi / 2 - absangle
    asteroids['vx'] = -1 * npf.sign(asteroids['y']) * npf.cos(thetav) * magv
    asteroids['vy'] = npf.sign(asteroids['x']) * npf.sin(thetav) * magv
    asteroids['vz'] = npf.zeros(b)
    asteroids['m'] = npf.random.random(b) * solarmass * 10 + 1e14;

    ss = {}
    for key in solarsystem:
        ss[key] = np.array(solarsystem[key].astype(dtype))
    a = {}
    for key in asteroids:
        a[key] = np.array(asteroids[key].astype(dtype))

    return ss, a


def main():
    nplanets, nbodies, timesteps = bench.args.size

    x_max = 1e18  # Simulation constants
    y_max = 1e18
    z_max = 1e18
    dt = 1e12

    solarsystem, asteroids = random_system(  # Setup galaxy
        x_max,
        y_max,
        z_max,
        nplanets,
        nbodies,
        bench.dtype
    )

    if bench.args.visualize:  # Init visuals
        plt, P3 = gfx_init(x_max, y_max, z_max)

    bench.start()  # Timer start
    for timestep in range(0, timesteps):  # Run simulation
        if bench.args.visualize and timestep % 10 == 0:  # With or without..
            gfx_show(plt, P3, solarsystem, asteroids)  # ..visuals
        move(solarsystem, asteroids, dt)
        bench.flush()
    bench.stop()  # Timer stop
    bench.pprint()  # Print results..
    if bench.args.visualize:  # Keep showing visuals
        plt.show(block=True)


if __name__ == "__main__":
    main()

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with Bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using R = NumCIL.Range;

namespace nice
{

	using NumCIL.Double;
	using TArray = NumCIL.Double.NdArray;
	using TData = System.Double;

    public static class NiceSolverDouble
    {
		//private static readonly Utilities.Generator<TArray, TData> Generate = new Utilities.Generator<TArray, TData>();

		private static TData ConvertValue(object o) { return (TData)o; }

		//Gravity
		private static TData G = (TData)6.673e-11;

		// Solar mass
		public const TData SOLARMASS = (TData)1.98892e30;

		// Discrete Time units
		public const TData DT = (TData)1e12;


		private static void FillDiagonal(TArray a, TData val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Bodies a, Bodies b)
		{
			var dx = b.x - a.x[R.NewAxis, R.All].Transposed;
			var dy = b.y - a.y[R.NewAxis, R.All].Transposed;
			var dz = b.z - a.z[R.NewAxis, R.All].Transposed;
			var pm = b.mass * a.mass[R.NewAxis, R.All].Transposed;

			if (a == b)
			{
				FillDiagonal(dx, 1);
				FillDiagonal(dy, 1);
				FillDiagonal(dz, 1);
				FillDiagonal(pm, 0);
			}

			var r = (dx.Pow(2) + dy.Pow(2) + dz.Pow(2)).Pow((TData)0.5);

			//In the below calc of the the forces the force of a body upon itself
			//becomes nan and thus destroys the data
			var Fx = G * pm / r.Pow(2) * (dx / r);
			var Fy = G * pm / r.Pow(2) * (dy / r);
			var Fz = G * pm / r.Pow(2) * (dz / r);

			//The diagonal nan numbers must be removed so that the force from a body
			//upon itself is zero
			if (a == b)
			{
				FillDiagonal(Fx, 0);
				FillDiagonal(Fy, 0);
				FillDiagonal(Fz, 0);
			}

			a.vx += Add.Reduce(Fx, 1) / a.mass * DT;
			a.vy += Add.Reduce(Fy, 1) / a.mass * DT;
			a.vz += Add.Reduce(Fz, 1) / a.mass * DT;
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy.SolarSystem, galaxy.SolarSystem);
			CalcForce(galaxy.Asteroids, galaxy.SolarSystem);

			galaxy.SolarSystem.x += galaxy.SolarSystem.vx * DT;
			galaxy.SolarSystem.y += galaxy.SolarSystem.vy * DT;
			galaxy.SolarSystem.z += galaxy.SolarSystem.vz * DT;

			galaxy.Asteroids.x += galaxy.Asteroids.vx * DT;
			galaxy.Asteroids.y += galaxy.Asteroids.vy * DT;
			galaxy.Asteroids.z += galaxy.Asteroids.vz * DT;
		}


		public class Galaxy
		{
			public Bodies SolarSystem;
			public Bodies Asteroids;

			public Galaxy(long planets, long asteroids)
			{
				this.SolarSystem = new SolarSystem(planets);
				this.Asteroids = new Asteroids(asteroids);
			}
		}

		public abstract class Bodies
		{
			public const TData XMAX = (TData)1e18;
			public const TData YMAX = (TData)1e18;
			public const TData ZMAX = (TData)1e18;

			public TArray mass;
			public TArray x;
			public TArray y;
			public TArray z;
			public TArray vx;
			public TArray vy;
			public TArray vz;

			protected void Reset(long size)
			{
				this.x = Generate.Random(size);
				this.y = Generate.Random(size);
				this.z = Generate.Random(size) * (TData)0.01;

				var dist = 1f / Sqrt.Apply((this.x.Pow(2f) + this.y.Pow(2f) + this.z.Pow(2f)));
				dist = dist - (TData)(0.8f - (new Random().NextDouble() * 0.1f));

				this.x = XMAX * this.x * dist * Sign.Apply(.5f - Generate.Random(size));
				this.y = YMAX * this.y * dist * Sign.Apply(.5f - Generate.Random(size));
				this.z = ZMAX * this.z * dist * Sign.Apply(.5f - Generate.Random(size));

				var magv = Cirklev(this.x, this.y, this.z);
				var absangle = Atan.Apply(Abs.Apply(this.y / this.x));
				var thetav= (TData)(Math.PI/2) - absangle;

				this.vx = (TData)(-1) * Sign.Apply(this.y) * Cos.Apply(thetav) * magv;
				this.vy = Sign.Apply(this.x) * Sin.Apply(thetav) * magv;
				this.vz = Generate.Zeroes(size);
			}

			private TArray Cirklev(TArray rx, TArray ry, TArray rz)
			{
				var r2 = Sqrt.Apply(rx * rx + ry * ry + rz * rz);
				var numerator = (TData)((6.67e-11) * 1e6 * SOLARMASS);
				return Sqrt.Apply(numerator / r2);
			}
		}

		public class SolarSystem : Bodies
		{
			public SolarSystem(long size)
			{
				base.Reset(size);

				this.mass = (Generate.Random(size) * (TData)(SOLARMASS * 10)) + (TData)1e20;

				this.mass[0]= (TData)1e6 * SOLARMASS;
				this.x[0]= 0;
				this.y[0]= 0;
				this.z[0]= 0;
				this.vx[0]= 0;
				this.vy[0]= 0;
				this.vz[0]= 0;
			}
		}


		public class Asteroids : Bodies
		{
			public Asteroids(long size)
			{
				base.Reset(size);
				this.mass = (Generate.Random(size) * (TData)(SOLARMASS * 10)) + (TData)1e14;
			}
		}

		public static Galaxy Create(long planets, long asteroids)
		{
			return new Galaxy(planets, asteroids);
		}

		public static void Solve(Galaxy galaxy, long steps, bool image_output)
		{
			if (image_output)
				Render(galaxy, 0, steps);

			for (long step = 0; step < steps; step++)
			{
				Move(galaxy);

				if (image_output)
					Render(galaxy, step + 1, steps);
			}
		}

		private static void Render(Galaxy galaxy, long step, long steps)
		{
			var image_width = 1024;
			var image_height = 1024;

			var axz = galaxy.Asteroids.x;// / galaxy.Asteroids.z;
			var ayz = galaxy.Asteroids.y;// / galaxy.Asteroids.z;
			var sxz = galaxy.SolarSystem.x;// / galaxy.SolarSystem.z;
			var syz = galaxy.SolarSystem.y;// / galaxy.SolarSystem.z;

			Func<int, int, int, System.Drawing.Rectangle> inflateBox = (x, y, s) => new System.Drawing.Rectangle(new System.Drawing.Point(x - (s/2), y - (s/2)), new System.Drawing.Size(s, s));

			var mass_min = Min.Reduce(galaxy.SolarSystem.mass).Value[0];
			var mass_max = Max.Reduce(galaxy.SolarSystem.mass).Value[0];

			var rangex_max = Math.Max(Max.Reduce(axz).Value[0], Max.Reduce(sxz).Value[0]);
			var rangex_min = Math.Min(Min.Reduce(axz).Value[0], Min.Reduce(sxz).Value[0]);

			var rangey_max = Math.Max(Max.Reduce(ayz).Value[0], Max.Reduce(syz).Value[0]);
			var rangey_min = Math.Min(Min.Reduce(ayz).Value[0], Min.Reduce(syz).Value[0]);

			var rangex_diff = rangex_max - rangex_min;
			var rangey_diff = rangey_max - rangey_min;

			rangex_min = -Bodies.XMAX * 1.2f;
			rangey_min = -Bodies.YMAX * 1.2f;
			rangex_diff = Bodies.XMAX * 2.4f;
			rangey_diff = Bodies.YMAX * 2.4f;

			var sx = NumCIL.Int32.Min.Apply(image_width - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((sxz - rangex_min) / rangex_diff) * (image_width - 1))))).AsArray();
			var sy = NumCIL.Int32.Min.Apply(image_height - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((syz - rangey_min) / rangey_diff) * (image_height - 1))))).AsArray();

			var ax = NumCIL.Int32.Min.Apply(image_width - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((axz - rangex_min) / rangex_diff) * (image_width - 1))))).AsArray();
			var ay = NumCIL.Int32.Min.Apply(image_height - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((ayz - rangey_min) / rangey_diff) * (image_height - 1))))).AsArray();

			using (var image = new System.Drawing.Bitmap(image_width, image_height))
			{
				using (var gc = System.Drawing.Graphics.FromImage(image))
					gc.Clear(System.Drawing.Color.White);
					
				using (var gc = System.Drawing.Graphics.FromImage(image))
				{
					for (var s = 0; s < ax.Length; s++)
						gc.DrawEllipse(System.Drawing.Pens.DarkGray, inflateBox(ax[s], ay[s], 1));
					for (var s = 0; s < sx.Length; s++)
						gc.DrawEllipse(System.Drawing.Pens.Red, inflateBox(sx[s], sy[s], 2));
				}

				image.Save(string.Format("step-{0:0000}.png", step), System.Drawing.Imaging.ImageFormat.Png);
			}

			Console.WriteLine("Completed step {0} of {1}", step, steps);
		}
	}
}

namespace nice
{

	using NumCIL.Float;
	using TArray = NumCIL.Float.NdArray;
	using TData = System.Single;

    public static class NiceSolverSingle
    {
		//private static readonly Utilities.Generator<TArray, TData> Generate = new Utilities.Generator<TArray, TData>();

		private static TData ConvertValue(object o) { return (TData)o; }

		//Gravity
		private static TData G = (TData)6.673e-11;

		// Solar mass
		public const TData SOLARMASS = (TData)1.98892e30;

		// Discrete Time units
		public const TData DT = (TData)1e12;


		private static void FillDiagonal(TArray a, TData val)
		{
			long d  = a.Shape.Dimensions[0].Length;
			a.Reshape(new NumCIL.Shape(new long[] { d }, 0, new long[] { d+1 })).Set(val);
		}

		private static void CalcForce(Bodies a, Bodies b)
		{
			var dx = b.x - a.x[R.NewAxis, R.All].Transposed;
			var dy = b.y - a.y[R.NewAxis, R.All].Transposed;
			var dz = b.z - a.z[R.NewAxis, R.All].Transposed;
			var pm = b.mass * a.mass[R.NewAxis, R.All].Transposed;

			if (a == b)
			{
				FillDiagonal(dx, 1);
				FillDiagonal(dy, 1);
				FillDiagonal(dz, 1);
				FillDiagonal(pm, 0);
			}

			var r = (dx.Pow(2) + dy.Pow(2) + dz.Pow(2)).Pow((TData)0.5);

			//In the below calc of the the forces the force of a body upon itself
			//becomes nan and thus destroys the data
			var Fx = G * pm / r.Pow(2) * (dx / r);
			var Fy = G * pm / r.Pow(2) * (dy / r);
			var Fz = G * pm / r.Pow(2) * (dz / r);

			//The diagonal nan numbers must be removed so that the force from a body
			//upon itself is zero
			if (a == b)
			{
				FillDiagonal(Fx, 0);
				FillDiagonal(Fy, 0);
				FillDiagonal(Fz, 0);
			}

			a.vx += Add.Reduce(Fx, 1) / a.mass * DT;
			a.vy += Add.Reduce(Fy, 1) / a.mass * DT;
			a.vz += Add.Reduce(Fz, 1) / a.mass * DT;
		}

		private static void Move(Galaxy galaxy)
		{
			CalcForce(galaxy.SolarSystem, galaxy.SolarSystem);
			CalcForce(galaxy.Asteroids, galaxy.SolarSystem);

			galaxy.SolarSystem.x += galaxy.SolarSystem.vx * DT;
			galaxy.SolarSystem.y += galaxy.SolarSystem.vy * DT;
			galaxy.SolarSystem.z += galaxy.SolarSystem.vz * DT;

			galaxy.Asteroids.x += galaxy.Asteroids.vx * DT;
			galaxy.Asteroids.y += galaxy.Asteroids.vy * DT;
			galaxy.Asteroids.z += galaxy.Asteroids.vz * DT;
		}


		public class Galaxy
		{
			public Bodies SolarSystem;
			public Bodies Asteroids;

			public Galaxy(long planets, long asteroids)
			{
				this.SolarSystem = new SolarSystem(planets);
				this.Asteroids = new Asteroids(asteroids);
			}
		}

		public abstract class Bodies
		{
			public const TData XMAX = (TData)1e18;
			public const TData YMAX = (TData)1e18;
			public const TData ZMAX = (TData)1e18;

			public TArray mass;
			public TArray x;
			public TArray y;
			public TArray z;
			public TArray vx;
			public TArray vy;
			public TArray vz;

			protected void Reset(long size)
			{
				this.x = Generate.Random(size);
				this.y = Generate.Random(size);
				this.z = Generate.Random(size) * (TData)0.01;

				var dist = 1f / Sqrt.Apply((this.x.Pow(2f) + this.y.Pow(2f) + this.z.Pow(2f)));
				dist = dist - (TData)(0.8f - (new Random().NextDouble() * 0.1f));

				this.x = XMAX * this.x * dist * Sign.Apply(.5f - Generate.Random(size));
				this.y = YMAX * this.y * dist * Sign.Apply(.5f - Generate.Random(size));
				this.z = ZMAX * this.z * dist * Sign.Apply(.5f - Generate.Random(size));

				var magv = Cirklev(this.x, this.y, this.z);
				var absangle = Atan.Apply(Abs.Apply(this.y / this.x));
				var thetav= (TData)(Math.PI/2) - absangle;

				this.vx = (TData)(-1) * Sign.Apply(this.y) * Cos.Apply(thetav) * magv;
				this.vy = Sign.Apply(this.x) * Sin.Apply(thetav) * magv;
				this.vz = Generate.Zeroes(size);
			}

			private TArray Cirklev(TArray rx, TArray ry, TArray rz)
			{
				var r2 = Sqrt.Apply(rx * rx + ry * ry + rz * rz);
				var numerator = (TData)((6.67e-11) * 1e6 * SOLARMASS);
				return Sqrt.Apply(numerator / r2);
			}
		}

		public class SolarSystem : Bodies
		{
			public SolarSystem(long size)
			{
				base.Reset(size);

				this.mass = (Generate.Random(size) * (TData)(SOLARMASS * 10)) + (TData)1e20;

				this.mass[0]= (TData)1e6 * SOLARMASS;
				this.x[0]= 0;
				this.y[0]= 0;
				this.z[0]= 0;
				this.vx[0]= 0;
				this.vy[0]= 0;
				this.vz[0]= 0;
			}
		}


		public class Asteroids : Bodies
		{
			public Asteroids(long size)
			{
				base.Reset(size);
				this.mass = (Generate.Random(size) * (TData)(SOLARMASS * 10)) + (TData)1e14;
			}
		}

		public static Galaxy Create(long planets, long asteroids)
		{
			return new Galaxy(planets, asteroids);
		}

		public static void Solve(Galaxy galaxy, long steps, bool image_output)
		{
			if (image_output)
				Render(galaxy, 0, steps);

			for (long step = 0; step < steps; step++)
			{
				Move(galaxy);

				if (image_output)
					Render(galaxy, step + 1, steps);
			}
		}

		private static void Render(Galaxy galaxy, long step, long steps)
		{
			var image_width = 1024;
			var image_height = 1024;

			var axz = galaxy.Asteroids.x;// / galaxy.Asteroids.z;
			var ayz = galaxy.Asteroids.y;// / galaxy.Asteroids.z;
			var sxz = galaxy.SolarSystem.x;// / galaxy.SolarSystem.z;
			var syz = galaxy.SolarSystem.y;// / galaxy.SolarSystem.z;

			Func<int, int, int, System.Drawing.Rectangle> inflateBox = (x, y, s) => new System.Drawing.Rectangle(new System.Drawing.Point(x - (s/2), y - (s/2)), new System.Drawing.Size(s, s));

			var mass_min = Min.Reduce(galaxy.SolarSystem.mass).Value[0];
			var mass_max = Max.Reduce(galaxy.SolarSystem.mass).Value[0];

			var rangex_max = Math.Max(Max.Reduce(axz).Value[0], Max.Reduce(sxz).Value[0]);
			var rangex_min = Math.Min(Min.Reduce(axz).Value[0], Min.Reduce(sxz).Value[0]);

			var rangey_max = Math.Max(Max.Reduce(ayz).Value[0], Max.Reduce(syz).Value[0]);
			var rangey_min = Math.Min(Min.Reduce(ayz).Value[0], Min.Reduce(syz).Value[0]);

			var rangex_diff = rangex_max - rangex_min;
			var rangey_diff = rangey_max - rangey_min;

			rangex_min = -Bodies.XMAX * 1.2f;
			rangey_min = -Bodies.YMAX * 1.2f;
			rangex_diff = Bodies.XMAX * 2.4f;
			rangey_diff = Bodies.YMAX * 2.4f;

			var sx = NumCIL.Int32.Min.Apply(image_width - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((sxz - rangex_min) / rangex_diff) * (image_width - 1))))).AsArray();
			var sy = NumCIL.Int32.Min.Apply(image_height - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((syz - rangey_min) / rangey_diff) * (image_height - 1))))).AsArray();

			var ax = NumCIL.Int32.Min.Apply(image_width - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((axz - rangex_min) / rangex_diff) * (image_width - 1))))).AsArray();
			var ay = NumCIL.Int32.Min.Apply(image_height - 1, NumCIL.Int32.Max.Apply(0, ((NumCIL.Int32.NdArray)(((ayz - rangey_min) / rangey_diff) * (image_height - 1))))).AsArray();

			using (var image = new System.Drawing.Bitmap(image_width, image_height))
			{
				using (var gc = System.Drawing.Graphics.FromImage(image))
					gc.Clear(System.Drawing.Color.White);
					
				using (var gc = System.Drawing.Graphics.FromImage(image))
				{
					for (var s = 0; s < ax.Length; s++)
						gc.DrawEllipse(System.Drawing.Pens.DarkGray, inflateBox(ax[s], ay[s], 1));
					for (var s = 0; s < sx.Length; s++)
						gc.DrawEllipse(System.Drawing.Pens.Red, inflateBox(sx[s], sy[s], 2));
				}

				image.Save(string.Format("step-{0:0000}.png", step), System.Drawing.Imaging.ImageFormat.Png);
			}

			Console.WriteLine("Completed step {0} of {1}", step, steps);
		}
	}
}

Quasicrystal

Python Numpy

# -*- coding: utf-8 -*-
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Quasi Crystal simulation", "k*stripes*image_size*iterations")


def main():
    k = bench.args.size[0]  # number of plane waves
    stripes = bench.args.size[1]  # number of stripes per wave
    N = bench.args.size[2]  # image size in pixels
    ite = bench.args.size[3]  # iterations

    phases = np.arange(0, 2 * np.pi, 2 * np.pi / ite)
    image = np.empty((N, N), dtype=bench.dtype)
    d = np.arange(-N / 2, N / 2, dtype=bench.dtype)

    xv, yv = np.meshgrid(d, d)
    theta = np.arctan2(yv, xv)
    r = np.log(np.sqrt(xv * xv + yv * yv))
    r[np.isinf(r) == True] = 0

    tcos = theta * np.cos(np.arange(0, np.pi, np.pi / k))[:, np.newaxis, np.newaxis]
    rsin = r * np.sin(np.arange(0, np.pi, np.pi / k))[:, np.newaxis, np.newaxis]
    inner = (tcos - rsin) * stripes

    cinner = np.cos(inner)
    sinner = np.sin(inner)

    bench.start()
    for phase in phases:
        image[:] = np.sum(cinner * np.cos(phase) - sinner * np.sin(phase), axis=0) + k
        bench.flush()
    bench.stop()
    bench.pprint()


if __name__ == "__main__":
    main()

Reactiondiffusion

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as 
published by the Free Software Foundation, either version 3 
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the 
GNU Lesser General Public License along with Bohrium. 

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using R = NumCIL.Range;

namespace reactiondiffusion
{

	using NumCIL.Double;
	using TArray = NumCIL.Double.NdArray;
	using TData = System.Double;

	public static class ReactionDiffusionSolverDouble
	{
		private static TArray Laplacian(TArray Z, TData dx)
		{
			var Ztop = Z[R.R(0, -2), R.R(1, -1)];
			var Zleft = Z[R.R(1, -1), R.R(0, -2)];
			var Zbottom = Z[R.R(2, 0), R.R(1, -1)];
			var Zright = Z[R.R(1, -1), R.R(2, 0)];
			var Zcenter = Z[R.R(1, -1), R.R(1, -1)];

			return (Ztop + Zleft + Zbottom + Zright - 4 * Zcenter) / (TData)Math.Pow(dx, 2);
		}

		public static void UpdateBondaries(TArray Z)
		{
			Z[R.El(0), R.All] = Z[R.El(1), R.All];
			Z[R.El(-1), R.All] = Z[R.El(-2), R.All];
			Z[R.All, R.El(0)] = Z[R.All, R.El(1)];
			Z[R.All, R.El(-1)] = Z[R.All, R.El(-2)];
		}

		public class Data
		{
			public TArray U;
			public TArray V;
			public long Size;
		}

		public static Data Create(long size)
		{
			return new Data()
			{
				U = Generate.Random(new long[] { size, size }),
				V = Generate.Random(new long[] { size, size }),
				Size = size
			};
		}

		public static TArray Solve(Data data, long iterations, bool image_output)
		{
			var a = (TData)2.8e-4;
			var b = (TData)5e-3;
			var tau = (TData)(0.1);
			var k = (TData)(-0.005);

			var dx = (TData)(2.0/data.Size);

			var T = (TData)10.0; // Total time
			var dt = (TData)(.9 * Math.Pow(dx,2) / 2);  // time step
			var n = (long)(T/dt);

			var U = data.U;
			var V = data.V;

			for(var step = 0L; step < iterations; step++)
			{
				if (image_output)
					Plot(data, step, iterations);

				// We compute the Laplacian of u and v.
				var deltaU = Laplacian(U, dx);
				var deltaV = Laplacian(V, dx);

				// We take the values of u and v inside the grid.
				var Uc = U[R.R(1,-1),R.R(1,-1)];
				var Vc = V[R.R(1,-1),R.R(1,-1)];

				// We update the variables.
				U[R.R(1,-1),R.R(1,-1)] = Uc + dt * (a * deltaU + Uc - Uc.Pow(3) - Vc + k);
				V[R.R(1,-1),R.R(1,-1)] = Vc + dt * (b * deltaV + Uc - Vc) / tau;
			
				// Neumann conditions: derivatives at the edges are null.
				UpdateBondaries(U);
				UpdateBondaries(V);


			}

			if (image_output)
				Plot(data, iterations, iterations);


			return U;
		}

		public static TData Sync(Data data)
		{
			return data.U.Value[0];
		}

		public static void Plot(Data data, long step, long steps)
		{
			var cm = Utilities.Render.ColorMap(System.Drawing.Color.Coral);
			var nm = Utilities.Render.Normalize<TData>(-1, 1);
			Utilities.Render.Plot<TData>(string.Format("step-{0:0000}.png", step), data.U, (x, y, v) => cm(nm(v)));
		}
	}
}

namespace reactiondiffusion
{

	using NumCIL.Float;
	using TArray = NumCIL.Float.NdArray;
	using TData = System.Single;

	public static class ReactionDiffusionSolverSingle
	{
		private static TArray Laplacian(TArray Z, TData dx)
		{
			var Ztop = Z[R.R(0, -2), R.R(1, -1)];
			var Zleft = Z[R.R(1, -1), R.R(0, -2)];
			var Zbottom = Z[R.R(2, 0), R.R(1, -1)];
			var Zright = Z[R.R(1, -1), R.R(2, 0)];
			var Zcenter = Z[R.R(1, -1), R.R(1, -1)];

			return (Ztop + Zleft + Zbottom + Zright - 4 * Zcenter) / (TData)Math.Pow(dx, 2);
		}

		public static void UpdateBondaries(TArray Z)
		{
			Z[R.El(0), R.All] = Z[R.El(1), R.All];
			Z[R.El(-1), R.All] = Z[R.El(-2), R.All];
			Z[R.All, R.El(0)] = Z[R.All, R.El(1)];
			Z[R.All, R.El(-1)] = Z[R.All, R.El(-2)];
		}

		public class Data
		{
			public TArray U;
			public TArray V;
			public long Size;
		}

		public static Data Create(long size)
		{
			return new Data()
			{
				U = Generate.Random(new long[] { size, size }),
				V = Generate.Random(new long[] { size, size }),
				Size = size
			};
		}

		public static TArray Solve(Data data, long iterations, bool image_output)
		{
			var a = (TData)2.8e-4;
			var b = (TData)5e-3;
			var tau = (TData)(0.1);
			var k = (TData)(-0.005);

			var dx = (TData)(2.0/data.Size);

			var T = (TData)10.0; // Total time
			var dt = (TData)(.9 * Math.Pow(dx,2) / 2);  // time step
			var n = (long)(T/dt);

			var U = data.U;
			var V = data.V;

			for(var step = 0L; step < iterations; step++)
			{
				if (image_output)
					Plot(data, step, iterations);

				// We compute the Laplacian of u and v.
				var deltaU = Laplacian(U, dx);
				var deltaV = Laplacian(V, dx);

				// We take the values of u and v inside the grid.
				var Uc = U[R.R(1,-1),R.R(1,-1)];
				var Vc = V[R.R(1,-1),R.R(1,-1)];

				// We update the variables.
				U[R.R(1,-1),R.R(1,-1)] = Uc + dt * (a * deltaU + Uc - Uc.Pow(3) - Vc + k);
				V[R.R(1,-1),R.R(1,-1)] = Vc + dt * (b * deltaV + Uc - Vc) / tau;
			
				// Neumann conditions: derivatives at the edges are null.
				UpdateBondaries(U);
				UpdateBondaries(V);


			}

			if (image_output)
				Plot(data, iterations, iterations);


			return U;
		}

		public static TData Sync(Data data)
		{
			return data.U.Value[0];
		}

		public static void Plot(Data data, long step, long steps)
		{
			var cm = Utilities.Render.ColorMap(System.Drawing.Color.Coral);
			var nm = Utilities.Render.Normalize<TData>(-1, 1);
			Utilities.Render.Plot<TData>(string.Format("step-{0:0000}.png", step), data.U, (x, y, v) => cm(nm(v)));
		}
	}
}

Rosenbrock

Python Numpy

from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Rosenbrock", "size*iterations")


def rosen(x):
    return np.sum(
        100.0 * (x[1:] - x[:-1] ** 2.0) ** 2.0 + \
        (1 - x[:-1]) ** 2.0
    )


def main():
    N, T = bench.args.size  # Grab command-line arguments
    dataset = np.arange(N, dtype=bench.dtype) / bench.dtype(N)

    bench.start()  # Sample wall-clock start
    res = 0.0
    for _ in range(0, T):  # Do T trials of..
        res += rosen(dataset)  # ..executing rosenbrock.
        bench.flush()
    res /= T
    bench.stop()  # Sample wall-clock stop
    bench.pprint()  # Print elapsed wall-clock etc.


if __name__ == "__main__":
    main()

C99 Seq

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <bp_util.h>

double rosenbrock(int nelements, double* x)
{
    double sum = 0.0;
    for(int i=0; i<nelements-2; ++i) {
        sum += 100.0*pow((x[i+1] - pow(x[i], 2)), 2) + pow((1-x[i]), 2);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];
    const int trials = bp.args.sizes[1];

    // Create the pseudo-data
    double* dataset = (double*)malloc(sizeof(double)*nelements);

    for(int i=0; i<nelements; ++i) {
        dataset[i] = i/(double)nelements;
    }

    bp.timer_start();                               // Start timer
    double res = 0.0;
    for(int i=0; i<trials; ++i) {
        res += rosenbrock(nelements, dataset);      // Run benchmark
    }
    res /= trials;
    bp.timer_stop();                                // Stop timer
    
    bp.print("rosenbrock(c99_seq)");                // Print results..
    if (bp.args.verbose) {                          // ..and value.
        printf("Result = %f\n", res);
    }

    return 0;
}

Cpp11 Omp

#include <iostream>
#include <iomanip>
#include <cmath>
#include <omp.h>
#include <bp_util.h>

using namespace std;

double rosenbrock(int nelements, double* x)
{
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for(int i=0; i<nelements-2; ++i) {
        sum += 100.0*pow((x[i+1] - pow(x[i], 2)), 2) + pow((1-x[i]), 2);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];
    const int trials = bp.args.sizes[1];

    // Create the pseudo-data
    double* dataset = (double*)malloc(sizeof(double)*nelements);

    #pragma omp parallel for
    for(int i=0; i<nelements; ++i) {
        dataset[i] = i/(double)nelements;
    }

    bp.timer_start();                               // Start timer
    double res = 0.0;
    for(int i=0; i<trials; ++i) {
        res += rosenbrock(nelements, dataset);      // Run benchmark
    }
    res /= trials;
    bp.timer_stop();                                // Stop timer
    bp.print("rosenbrock(cpp11_omp)");
    if (bp.args.verbose) {                          // ..and value.
        cout << fixed << setprecision(11)
			 << "Result = " << res << endl;
    }

    return 0;
}

Cpp11 Seq

#include <iostream>
#include <iomanip>
#include <cmath>
#include <bp_util.h>

using namespace std;

double rosenbrock(int nelements, double* x)
{
    double sum = 0.0;
    for(int i=0; i<nelements-2; ++i) {
        sum += 100.0*pow((x[i+1] - pow(x[i], 2)), 2) + pow((1-x[i]), 2);
    }
    return sum;
}

int main(int argc, char* argv[])
{
    bp_util_type bp = bp_util_create(argc, argv, 2);// Grab arguments
    if (bp.args.has_error) {
        return 1;
    }
    const int nelements = bp.args.sizes[0];
    const int trials = bp.args.sizes[1];

    // Create the pseudo-data
    double* dataset = (double*)malloc(sizeof(double)*nelements);

    for(int i=0; i<nelements; ++i) {
        dataset[i] = i/(double)nelements;
    }

    bp.timer_start();                               // Start timer
    double res = 0.0;
    for(int i=0; i<trials; ++i) {
        res += rosenbrock(nelements, dataset);      // Run benchmark
    }
    res /= trials;
    bp.timer_stop();                                // Stop timer
    bp.print("rosenbrock(cpp11_seq)");
    if (bp.args.verbose) {                          // ..and value.
        cout << fixed << setprecision(11)
			 << "Result = " << res << endl;
    }

    return 0;
}

Shallow Water

Python Numpy

from __future__ import print_function

"""
Shallow Water
-------------

So what does this code example illustrate?

Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/
"""
import numpy as np
from benchpress.benchmarks import util

bench = util.Benchmark("Shallow Water", "height*width*iterations")
g = 9.80665  # gravitational acceleration


def droplet(height, width, data_type=np.float32):
    """Generate grid of droplets"""

    x = np.array(np.linspace(-1, 1, num=width, endpoint=True), dtype=data_type)
    y = np.array(np.linspace(-1, 1, num=width, endpoint=True), dtype=data_type)
    (xx, yy) = np.meshgrid(x, y)
    droplet = height * np.exp(-5 * (xx ** 2 + yy ** 2))
    return droplet


def model(height, width, dtype=np.float32):
    assert height >= 16
    assert width >= 16
    m = np.ones((height, width), dtype=dtype)
    D = droplet(8, 8)  # simulate a water drop
    droploc = height // 4
    (dropx, dropy) = D.shape
    m[droploc:droploc + dropx, droploc:droploc + dropy] += D
    droploc = height // 2
    (dropx, dropy) = D.shape
    m[droploc:droploc + dropx, droploc:droploc + dropy] += D
    return {"H": m, "U": np.zeros_like(m), "V": np.zeros_like(m)}


def simulate(state, timesteps):
    # FLOP count: i*(12*s + 4*s**2 + 14*s**2 + 9*s**2 + 4*s**2 + 9*s**2 + 14*s**2 + 6*s**2 + 19*s**2 + 19*s**2)
    # where s is size and i is iterations
    def loop_body(H, U, V, dt=0.02, dx=1.0, dy=1.0):
        # Reflecting boundary conditions
        H[:, 0] = H[:, 1];
        U[:, 0] = U[:, 1];
        V[:, 0] = -V[:, 1]
        H[:, -1] = H[:, -2];
        U[:, -1] = U[:, -2];
        V[:, -1] = -V[:, -2]
        H[0, :] = H[1, :];
        U[0, :] = -U[1, :];
        V[0, :] = V[1, :]
        H[-1, :] = H[-2, :];
        U[-1, :] = -U[-2, :];
        V[-1, :] = V[-2, :]

        # First half step

        # height
        Hx = (H[1:, 1:-1] + H[:-1, 1:-1]) / 2 - dt / (2 * dx) * (U[1:, 1:-1] - U[:-1, 1:-1])

        # x momentum
        Ux = (U[1:, 1:-1] + U[:-1, 1:-1]) / 2 - \
             dt / (2 * dx) * ((U[1:, 1:-1] ** 2 / H[1:, 1:-1] + g / 2 * H[1:, 1:-1] ** 2) -
                              (U[:-1, 1:-1] ** 2 / H[:-1, 1:-1] + g / 2 * H[:-1, 1:-1] ** 2))

        # y momentum
        Vx = (V[1:, 1:-1] + V[:-1, 1:-1]) / 2 - \
             dt / (2 * dx) * ((U[1:, 1:-1] * V[1:, 1:-1] / H[1:, 1:-1]) -
                              (U[:-1, 1:-1] * V[:-1, 1:-1] / H[:-1, 1:-1]))

        # height
        Hy = (H[1:-1, 1:] + H[1:-1, :-1]) / 2 - dt / (2 * dy) * (V[1:-1, 1:] - V[1:-1, :-1])

        # x momentum
        Uy = (U[1:-1, 1:] + U[1:-1, :-1]) / 2 - \
             dt / (2 * dy) * ((V[1:-1, 1:] * U[1:-1, 1:] / H[1:-1, 1:]) -
                              (V[1:-1, :-1] * U[1:-1, :-1] / H[1:-1, :-1]))
        # y momentum
        Vy = (V[1:-1, 1:] + V[1:-1, :-1]) / 2 - \
             dt / (2 * dy) * ((V[1:-1, 1:] ** 2 / H[1:-1, 1:] + g / 2 * H[1:-1, 1:] ** 2) -
                              (V[1:-1, :-1] ** 2 / H[1:-1, :-1] + g / 2 * H[1:-1, :-1] ** 2))

        # Second half step

        # height
        H[1:-1, 1:-1] -= (dt / dx) * (Ux[1:, :] - Ux[:-1, :]) + (dt / dy) * (Vy[:, 1:] - Vy[:, :-1])

        # x momentum
        U[1:-1, 1:-1] -= (dt / dx) * ((Ux[1:, :] ** 2 / Hx[1:, :] + g / 2 * Hx[1:, :] ** 2) -
                                      (Ux[:-1, :] ** 2 / Hx[:-1, :] + g / 2 * Hx[:-1, :] ** 2)) + \
                         (dt / dy) * ((Vy[:, 1:] * Uy[:, 1:] / Hy[:, 1:]) -
                                      (Vy[:, :-1] * Uy[:, :-1] / Hy[:, :-1]))
        # y momentum
        V[1:-1, 1:-1] -= (dt / dx) * ((Ux[1:, :] * Vx[1:, :] / Hx[1:, :]) -
                                      (Ux[:-1, :] * Vx[:-1, :] / Hx[:-1, :])) + \
                         (dt / dy) * ((Vy[:, 1:] ** 2 / Hy[:, 1:] + g / 2 * Hy[:, 1:] ** 2) -
                                      (Vy[:, :-1] ** 2 / Hy[:, :-1] + g / 2 * Hy[:, :-1] ** 2))
        bench.plot_surface(H, "3d", 0, 0, 5.5)

    bench.do_while(loop_body, timesteps, state['H'], state['U'], state['V'])


def main():
    H = bench.args.size[0]
    W = bench.args.size[1]
    I = bench.args.size[2]

    state = bench.load_data()
    if state is None:
        state = model(H, W, dtype=bench.dtype)

    bench.start()
    simulate(state, I)
    bench.stop()
    bench.save_data(state)
    bench.pprint()


if __name__ == "__main__":
    main()

C99 Seq

// Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/
// Saved images may be converted into an animated gif with:
// convert   -delay 20   -loop 0   swater*.png   swater.gif


#include <math.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <bp_util.h>

#define iH(y,x) (H[(x)+(y)*(n+2)])
#define iU(y,x) (U[(x)+(y)*(n+2)])
#define iV(y,x) (V[(x)+(y)*(n+2)])

#define iHx(y,x) (Hx[(x)+(y)*(n+1)])
#define iUx(y,x) (Ux[(x)+(y)*(n+1)])
#define iVx(y,x) (Vx[(x)+(y)*(n+1)])

#define iHy(y,x) (Hy[(x)+(y)*(n+1)])
#define iUy(y,x) (Uy[(x)+(y)*(n+1)])
#define iVy(y,x) (Vy[(x)+(y)*(n+1)])

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    const int n = bp.args.sizes[0];
    const int other_n = bp.args.sizes[1];
    const int T = bp.args.sizes[2];

    if (n != other_n) {
        fprintf(stderr, "Implementation only supports quadratic grid.\n");
        exit(0);
    }

    const double g   = 9.8;       // gravitational constant
    const double dt  = 0.02;      // hardwired timestep
    const double dx  = 1.0;
    const double dy  = 1.0;
    const int droploc = n/4;

    double *H = (double*) malloc((n+2)*(n+2)*sizeof(double));
    double *U = (double*) malloc((n+2)*(n+2)*sizeof(double));
    double *V = (double*) malloc((n+2)*(n+2)*sizeof(double));

    double *Hx = (double*) malloc((n+1)*(n+1)*sizeof(double));
    double *Ux = (double*) malloc((n+1)*(n+1)*sizeof(double));
    double *Vx = (double*) malloc((n+1)*(n+1)*sizeof(double));

    double *Hy = (double*) malloc((n+1)*(n+1)*sizeof(double));
    double *Uy = (double*) malloc((n+1)*(n+1)*sizeof(double));
    double *Vy = (double*) malloc((n+1)*(n+1)*sizeof(double));


    for(int i=0; i<n+2; i++)
        for(int j=0; j<n+2; j++)
        {
          iH(i,j)=0.1;
          iU(i,j)=0.1;
          iV(i,j)=0.1;
        }


    for(int i=0; i<n+1; i++)
        for(int j=0; j<n+1; j++)
        {
          iHx(i,j)=0.1;
          iUx(i,j)=0.1;
          iVx(i,j)=0.1;

          iHy(i,j)=0.1;
          iUy(i,j)=0.1;
          iVy(i,j)=0.1;
        }

    iH(droploc,droploc) += 5.0;

    
    bp.timer_start();

    for(int iter=0; iter < T; iter++)
    {
        // Reflecting boundary conditions
        for(int i=0; i<n+2; i++)
        {
            iH(i,0) = iH(i,1)   ; iU(i,0) = iU(i,1)     ; iV(i,0) = -iV(i,1);
            iH(i,n+1) = iH(i,n) ; iU(i,n+1) = iU(i,n)   ; iV(i,n+1) = -iV(i,n);
            iH(0,i) = iH(1,i)   ; iU(0,i) = -iU(1,i)    ; iV(0,i) = iV(1,i);
            iH(n+1,i) = iH(n,i) ; iU(n+1,i) = -iU(n,i)  ; iV(n+1,i) = iV(n,i);
        }
        //
        // First half step
        //
        for(int i=0; i<n+1; i++)
        {
            for(int j=0; j<n; j++)
            {

                // height
                iHx(i,j) = (iH(i+1,j+1)+iH(i,j+1))/2 - dt/(2*dx)*(iU(i+1,j+1)-iU(i,j+1));

                // x momentum
                iUx(i,j) = (iU(i+1,j+1)+iU(i,j+1))/2 -          \
                dt/(2*dx)*((pow(iU(i+1,j+1),2)/iH(i+1,j+1) +	\
                          g/2*pow(iH(i+1,j+1),2)) -		\
                         (pow(iU(i,j+1),2)/iH(i,j+1) +	        \
                          g/2*pow(iH(i,j+1),2)));

                // y momentum
                iVx(i,j) = (iV(i+1,j+1)+iV(i,j+1))/2 -          \
                      dt/(2*dx)*((iU(i+1,j+1) *                 \
                                  iV(i+1,j+1)/iH(i+1,j+1)) -    \
                                 (iU(i,j+1) *                   \
                                  iV(i,j+1)/iH(i,j+1)));
            }
        }

        for(int i=0; i<n; i++)
        {
            for(int j=0; j<n+1; j++)
            {
                //height
                iHy(i,j) = (iH(i+1,j+1)+iH(i+1,j))/2 - dt/(2*dy)*(iV(i+1,j+1)-iV(i+1,j));

                //x momentum
                iUy(i,j) = (iU(i+1,j+1)+iU(i+1,j))/2 -	   \
                dt/(2*dy)*((iV(i+1,j+1) *                  \
                           iU(i+1,j+1)/iH(i+1,j+1)) -	   \
                                (iV(i+1,j) *               \
                                 iU(i+1,j)/iH(i+1,j)));

                //y momentum
                iVy(i,j) = (iV(i+1,j+1)+iV(i+1,j))/2 -	\
                dt/(2*dy)*((pow(iV(i+1,j+1),2)/iH(i+1,j+1) +	\
                           g/2*pow(iH(i+1,j+1),2)) -	\
                          (pow(iV(i+1,j),2)/iH(i+1,j) +	\
                           g/2*pow(iH(i+1,j),2)));
            }
        }
        //
        // Second half step
        //

        for(int i=1; i<n+1; i++)
        {
            for(int j=1; j<n+1; j++)
            {
                //height
                iH(i,j) -= (dt/dx)*(iUx(i,j-1)-iUx(i-1,j-1)) - (dt/dy)*(iVy(i-1,j)-iVy(i-1,j-1));

                // x momentum
                iU(i,j) -= (dt/dx)*((pow(iUx(i,j-1),2)/iHx(i,j-1) + g/2*pow(iHx(i,j-1),2)) -        \
                                    (pow(iUx(i-1,j-1),2)/iHx(i-1,j-1) + g/2*pow(iHx(i-1,j-1),2))) - \
                           (dt/dy)*((iVy(i-1,j) * iUy(i-1,j)/iHy(i-1,j)) -
                                    (iVy(i-1,j-1) * iUy(i-1,j-1)/iHy(i-1,j-1)));

                // y momentum    - score
                iV(i,j) -= (dt/dx)*((iUx(i,j-1) * iVx(i,j-1)/iHx(i,j-1)) -                         \
                                    (iUx(i-1,j-1)*iVx(i-1,j-1)/iHx(i-1,j-1))) -                    \
                           (dt/dy)*((pow(iVy(i-1,j),2)/iHy(i-1,j) + g/2*pow(iHy(i-1,j),2)) -       \
                                    (pow(iVy(i-1,j-1),2)/iHy(i-1,j-1) + g/2*pow(iHy(i-1,j-1),2)));

            }
        }
    }

    /* NumPy implementation does not do this!
    // res = numpy.add.reduce(numpy.add.reduce(H / n))
    double res = 0.0;
    for(int i=0;i<n+2;i++)
        for(int j=0;j<n+2;j++)
            res+=iH(i,j)/n;
    */
    bp.timer_stop();
    bp.print("shallow_water(c99_seq)");
}

Cpp11 Boost

// Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/
// Saved images may be converted into an animated gif with:
// convert   -delay 20   -loop 0   swater*.png   swater.gif
#include <math.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <bp_util.h>

#include <boost/multi_array.hpp>

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    const int n         = bp.args.sizes[0];
    const int other_n   = bp.args.sizes[1];
    const int T         = bp.args.sizes[2];

    if (n != other_n) {
        fprintf(stderr, "Implementation only supports quadratic grid.\n");
        exit(0);
    }

    const double g   = 9.8;       // gravitational constant
    const double dt  = 0.02;      // hardwired timestep
    const double dx  = 1.0;
    const double dy  = 1.0;
    const int droploc = n/4;

    typedef boost::multi_array<double, 2> array_type;

    array_type H(boost::extents[n+2][n+2]);
    array_type U(boost::extents[n+2][n+2]);
    array_type V(boost::extents[n+2][n+2]);

    array_type Hx(boost::extents[n+1][n+1]);
    array_type Ux(boost::extents[n+1][n+1]);
    array_type Vx(boost::extents[n+1][n+1]);
    array_type Hy(boost::extents[n+1][n+1]);
    array_type Uy(boost::extents[n+1][n+1]);
    array_type Vy(boost::extents[n+1][n+1]);


    for(int i=0; i<n+2; i++)
        for(int j=0; j<n+2; j++)
        {
          H[i][j]=0.1;
          U[i][j]=0.1;
          V[i][j]=0.1;
        }


    for(int i=0; i<n+1; i++)
        for(int j=0; j<n+1; j++)
        {
          Hx[i][j]=0.1;
          Ux[i][j]=0.1;
          Vx[i][j]=0.1;

          Hy[i][j]=0.1;
          Uy[i][j]=0.1;
          Vy[i][j]=0.1;
        }

    H[droploc][droploc] += 5.0;

    bp.timer_start();

    for(int iter=0; iter < T; iter++)
    {
        // Reflecting boundary conditions
        for(int i=0; i<n+2; i++)
        {
            H[i][0] = H[i][1]   ; U[i][0] = U[i][1]     ; V[i][0] = -V[i][1];
            H[i][n+1] = H[i][n] ; U[i][n+1] = U[i][n]   ; V[i][n+1] = -V[i][n];
            H[0][i] = H[1][i]   ; U[0][i] = -U[1][i]    ; V[0][i] = V[1][i];
            H[n+1][i] = H[n][i] ; U[n+1][i] = -U[n][i]  ; V[n+1][i] = V[n][i];
        }
        //
        // First half step
        //
        for(int i=0; i<n+1; i++)
        {
            for(int j=0; j<n; j++)
            {

                // height
                Hx[i][j] = (H[i+1][j+1]+H[i][j+1])/2 - dt/(2*dx)*(U[i+1][j+1]-U[i][j+1]);

                // x momentum
                Ux[i][j] = (U[i+1][j+1]+U[i][j+1])/2 -          \
                dt/(2*dx)*((pow(U[i+1][j+1],2)/H[i+1][j+1] +	\
                          g/2*pow(H[i+1][j+1],2)) -		\
                         (pow(U[i][j+1],2)/H[i][j+1] +	        \
                          g/2*pow(H[i][j+1],2)));

                // y momentum
                Vx[i][j] = (V[i+1][j+1]+V[i][j+1])/2 -          \
                      dt/(2*dx)*((U[i+1][j+1] *                 \
                                  V[i+1][j+1]/H[i+1][j+1]) -    \
                                 (U[i][j+1] *                   \
                                  V[i][j+1]/H[i][j+1]));
            }
        }

        for(int i=0; i<n; i++)
        {
            for(int j=0; j<n+1; j++)
            {
                //height
                Hy[i][j] = (H[i+1][j+1]+H[i+1][j])/2 - dt/(2*dy)*(V[i+1][j+1]-V[i+1][j]);

                //x momentum
                Uy[i][j] = (U[i+1][j+1]+U[i+1][j])/2 -	   \
                dt/(2*dy)*((V[i+1][j+1] *                  \
                           U[i+1][j+1]/H[i+1][j+1]) -	   \
                                (V[i+1][j] *               \
                                 U[i+1][j]/H[i+1][j]));

                //y momentum
                Vy[i][j] = (V[i+1][j+1]+V[i+1][j])/2 -	\
                dt/(2*dy)*((pow(V[i+1][j+1],2)/H[i+1][j+1] +	\
                           g/2*pow(H[i+1][j+1],2)) -	\
                          (pow(V[i+1][j],2)/H[i+1][j] +	\
                           g/2*pow(H[i+1][j],2)));
            }
        }
        //
        // Second half step
        //

        for(int i=1; i<n+1; i++)
        {
            for(int j=1; j<n+1; j++)
            {
                //height
                H[i][j] -= (dt/dx)*(Ux[i][j-1]-Ux[i-1][j-1]) - (dt/dy)*(Vy[i-1][j]-Vy[i-1][j-1]);

                // x momentum
                U[i][j] -= (dt/dx)*((pow(Ux[i][j-1],2)/Hx[i][j-1] + g/2*pow(Hx[i][j-1],2)) -        \
                                    (pow(Ux[i-1][j-1],2)/Hx[i-1][j-1] + g/2*pow(Hx[i-1][j-1],2))) - \
                           (dt/dy)*((Vy[i-1][j] * Uy[i-1][j]/Hy[i-1][j]) -
                                    (Vy[i-1][j-1] * Uy[i-1][j-1]/Hy[i-1][j-1]));

                // y momentum    - score
                V[i][j] -= (dt/dx)*((Ux[i][j-1] * Vx[i][j-1]/Hx[i][j-1]) -                         \
                                    (Ux[i-1][j-1]*Vx[i-1][j-1]/Hx[i-1][j-1])) -                    \
                           (dt/dy)*((pow(Vy[i-1][j],2)/Hy[i-1][j] + g/2*pow(Hy[i-1][j],2)) -       \
                                    (pow(Vy[i-1][j-1],2)/Hy[i-1][j-1] + g/2*pow(Hy[i-1][j-1],2)));

            }
        }
    }

    // res = numpy.add.reduce(numpy.add.reduce(H / n))
    double res = 0.0;
    for(int i=0;i<n+2;i++)
        for(int j=0;j<n+2;j++)
            res+=H[i][j]/n;

    bp.timer_stop();
    bp.print("shallow_water(cpp11_boost)");
}

Cpp11 Omp

// Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/
// Saved images may be converted into an animated gif with:
// convert   -delay 20   -loop 0   swater*.png   swater.gif
#include <math.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <bp_util.h>

#define WIDTH 5000
#define HEIGHT 5000

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    const int height        = bp.args.sizes[0];
    const int width         = bp.args.sizes[1];
    const int iterations    = bp.args.sizes[2];

    if (width != WIDTH) {
        fprintf(stderr, "Unsupported size, size must be %dx%d\n", WIDTH, HEIGHT);
        exit(0);
    }
    if (height != WIDTH) {
        fprintf(stderr, "Unsupported size, size must be %dx%d\n", WIDTH, HEIGHT);
        exit(0);
    }

    if (width != height) {
        fprintf(stderr, "Implementation only supports quadratic grid.\n");
        exit(0);
    }

    const double g   = 9.8;       // gravitational constant
    const double dt  = 0.02;      // hardwired timestep
    const double dx  = 1.0;
    const double dy  = 1.0;
    const int droploc = WIDTH/4;

    auto H = new double[WIDTH+2][WIDTH+2];
    auto U = new double[WIDTH+2][WIDTH+2];
    auto V = new double[WIDTH+2][WIDTH+2];

    auto Hx = new double[WIDTH+1][WIDTH+1];
    auto Ux = new double[WIDTH+1][WIDTH+1];
    auto Vx = new double[WIDTH+1][WIDTH+1];
    auto Hy = new double[WIDTH+1][WIDTH+1];
    auto Uy = new double[WIDTH+1][WIDTH+1];
    auto Vy = new double[WIDTH+1][WIDTH+1];

    #pragma omp parallel for collapse(2)
    for(int i=0; i<HEIGHT+2; i++) {
        for(int j=0; j<WIDTH+2; j++) {
          H[i][j]=0.1;
          U[i][j]=0.1;
          V[i][j]=0.1;
        }
    }

    #pragma omp parallel for collapse(2)
    for(int i=0; i<HEIGHT+1; i++) {
        for(int j=0; j<WIDTH+1; j++) {
          Hx[i][j]=0.1;
          Ux[i][j]=0.1;
          Vx[i][j]=0.1;

          Hy[i][j]=0.1;
          Uy[i][j]=0.1;
          Vy[i][j]=0.1;
        }
    }
    H[droploc][droploc] += 5.0;

    bp.timer_start();

    for(int iter=0; iter < iterations; iter++) {

        #pragma omp parallel for        
        for(int i=0; i<WIDTH+2; i++) {  // Reflecting boundary conditions
            H[i][0] = H[i][1];
            U[i][0] = U[i][1];
            V[i][0] = -V[i][1];
            H[i][WIDTH+1] = H[i][WIDTH];
            U[i][WIDTH+1] = U[i][WIDTH];
            V[i][WIDTH+1] = -V[i][WIDTH];
            H[0][i] = H[1][i];
            U[0][i] = -U[1][i];
            V[0][i] = V[1][i];
            H[HEIGHT+1][i] = H[HEIGHT][i];
            U[HEIGHT+1][i] = -U[HEIGHT][i];
            V[HEIGHT+1][i] = V[HEIGHT][i];
        }

        //
        // First half step
        //
        #pragma omp parallel for collapse(2)
        for(int i=0; i<HEIGHT+1; i++) {
            for(int j=0; j<WIDTH; j++) {

                // height
                Hx[i][j] = (H[i+1][j+1]+H[i][j+1])/2 - dt/(2*dx)*(U[i+1][j+1]-U[i][j+1]);

                // x momentum
                Ux[i][j] = (U[i+1][j+1]+U[i][j+1])/2 -          \
                dt/(2*dx)*((pow(U[i+1][j+1],2)/H[i+1][j+1] +	\
                          g/2*pow(H[i+1][j+1],2)) -		\
                         (pow(U[i][j+1],2)/H[i][j+1] +	        \
                          g/2*pow(H[i][j+1],2)));

                // y momentum
                Vx[i][j] = (V[i+1][j+1]+V[i][j+1])/2 -          \
                      dt/(2*dx)*((U[i+1][j+1] *                 \
                                  V[i+1][j+1]/H[i+1][j+1]) -    \
                                 (U[i][j+1] *                   \
                                  V[i][j+1]/H[i][j+1]));
            }
        }

        #pragma omp parallel for collapse(2)
        for(int i=0; i<HEIGHT; i++) {
            for(int j=0; j<WIDTH+1; j++) {
                //height
                Hy[i][j] = (H[i+1][j+1]+H[i+1][j])/2 - dt/(2*dy)*(V[i+1][j+1]-V[i+1][j]);

                //x momentum
                Uy[i][j] = (U[i+1][j+1]+U[i+1][j])/2 -	   \
                dt/(2*dy)*((V[i+1][j+1] *                  \
                           U[i+1][j+1]/H[i+1][j+1]) -	   \
                                (V[i+1][j] *               \
                                 U[i+1][j]/H[i+1][j]));

                //y momentum
                Vy[i][j] = (V[i+1][j+1]+V[i+1][j])/2 -	\
                dt/(2*dy)*((pow(V[i+1][j+1],2)/H[i+1][j+1] +	\
                           g/2*pow(H[i+1][j+1],2)) -	\
                          (pow(V[i+1][j],2)/H[i+1][j] +	\
                           g/2*pow(H[i+1][j],2)));
            }
        }

        //
        // Second half step
        //

        #pragma omp parallel for collapse(2)
        for(int i=1; i<HEIGHT+1; i++) {
            for(int j=1; j<WIDTH+1; j++) {
                //height
                H[i][j] -= (dt/dx)*(Ux[i][j-1]-Ux[i-1][j-1]) - (dt/dy)*(Vy[i-1][j]-Vy[i-1][j-1]);

                // x momentum
                U[i][j] -= (dt/dx)*((pow(Ux[i][j-1],2)/Hx[i][j-1] + g/2*pow(Hx[i][j-1],2)) -        \
                                    (pow(Ux[i-1][j-1],2)/Hx[i-1][j-1] + g/2*pow(Hx[i-1][j-1],2))) - \
                           (dt/dy)*((Vy[i-1][j] * Uy[i-1][j]/Hy[i-1][j]) -
                                    (Vy[i-1][j-1] * Uy[i-1][j-1]/Hy[i-1][j-1]));

                // y momentum    - score
                V[i][j] -= (dt/dx)*((Ux[i][j-1] * Vx[i][j-1]/Hx[i][j-1]) -                         \
                                    (Ux[i-1][j-1]*Vx[i-1][j-1]/Hx[i-1][j-1])) -                    \
                           (dt/dy)*((pow(Vy[i-1][j],2)/Hy[i-1][j] + g/2*pow(Hy[i-1][j],2)) -       \
                                    (pow(Vy[i-1][j-1],2)/Hy[i-1][j-1] + g/2*pow(Hy[i-1][j-1],2)));
            }
        }
    }

    /* NumPy implementation does not do this!?
    // res = numpy.add.reduce(numpy.add.reduce(H / n))
    double res = 0.0;
    #pragma omp parallel for collapse(2) reduction(+:res)
    for(int i=0;i<HEIGHT+2;i++) {
        for(int j=0;j<WIDTH+2;j++) {
            res+=H[i][j]/WIDTH;
        }
    }
    */

    bp.timer_stop();
    bp.print("shallow_water(cpp11_omp)");
}

Cpp11 Seq

// Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/
// Saved images may be converted into an animated gif with:
// convert   -delay 20   -loop 0   swater*.png   swater.gif
#include <math.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <bp_util.h>

#define WIDTH 5000
#define HEIGHT 5000

int main (int argc, char **argv)
{
    bp_util_type bp = bp_util_create(argc, argv, 3);
    if (bp.args.has_error) {
        return 1;
    }
    const int height    = bp.args.sizes[0];
    const int width     = bp.args.sizes[1];
    const int T         = bp.args.sizes[2];

    if (width != WIDTH) {
        fprintf(stderr, "Unsupported size, size must be %dx%d\n", WIDTH, HEIGHT);
        exit(0);
    }
    if (height != WIDTH) {
        fprintf(stderr, "Unsupported size, size must be %dx%d\n", WIDTH, HEIGHT);
        exit(0);
    }

    if (width != height) {
        fprintf(stderr, "Implementation only supports quadratic grid.\n");
        exit(0);
    }

    const double g   = 9.8;       // gravitational constant
    const double dt  = 0.02;      // hardwired timestep
    const double dx  = 1.0;
    const double dy  = 1.0;
    const int droploc = WIDTH/4;

    auto H = new double[WIDTH+2][WIDTH+2];
    auto U = new double[WIDTH+2][WIDTH+2];
    auto V = new double[WIDTH+2][WIDTH+2];

    auto Hx = new double[WIDTH+1][WIDTH+1];
    auto Ux = new double[WIDTH+1][WIDTH+1];
    auto Vx = new double[WIDTH+1][WIDTH+1];
    auto Hy = new double[WIDTH+1][WIDTH+1];
    auto Uy = new double[WIDTH+1][WIDTH+1];
    auto Vy = new double[WIDTH+1][WIDTH+1];

    for(int i=0; i<HEIGHT+2; i++) {
        for(int j=0; j<WIDTH+2; j++) {
          H[i][j]=0.1;
          U[i][j]=0.1;
          V[i][j]=0.1;
        }
    }


    for(int i=0; i<HEIGHT+1; i++) {
        for(int j=0; j<WIDTH+1; j++) {
          Hx[i][j]=0.1;
          Ux[i][j]=0.1;
          Vx[i][j]=0.1;

          Hy[i][j]=0.1;
          Uy[i][j]=0.1;
          Vy[i][j]=0.1;
        }
    }
    H[droploc][droploc] += 5.0;

    bp.timer_start();

    for(int iter=0; iter < T; iter++) {
        
        for(int i=0; i<WIDTH+2; i++) {  // Reflecting boundary conditions
            H[i][0] = H[i][1];
            U[i][0] = U[i][1];
            V[i][0] = -V[i][1];
            H[i][WIDTH+1] = H[i][WIDTH];
            U[i][WIDTH+1] = U[i][WIDTH];
            V[i][WIDTH+1] = -V[i][WIDTH];
            H[0][i] = H[1][i];
            U[0][i] = -U[1][i];
            V[0][i] = V[1][i];
            H[HEIGHT+1][i] = H[HEIGHT][i];
            U[HEIGHT+1][i] = -U[HEIGHT][i];
            V[HEIGHT+1][i] = V[HEIGHT][i];
        }

        //
        // First half step
        //

        for(int i=0; i<HEIGHT+1; i++) {
            for(int j=0; j<WIDTH; j++) {

                // height
                Hx[i][j] = (H[i+1][j+1]+H[i][j+1])/2 - dt/(2*dx)*(U[i+1][j+1]-U[i][j+1]);

                // x momentum
                Ux[i][j] = (U[i+1][j+1]+U[i][j+1])/2 -          \
                dt/(2*dx)*((pow(U[i+1][j+1],2)/H[i+1][j+1] +	\
                          g/2*pow(H[i+1][j+1],2)) -		\
                         (pow(U[i][j+1],2)/H[i][j+1] +	        \
                          g/2*pow(H[i][j+1],2)));

                // y momentum
                Vx[i][j] = (V[i+1][j+1]+V[i][j+1])/2 -          \
                      dt/(2*dx)*((U[i+1][j+1] *                 \
                                  V[i+1][j+1]/H[i+1][j+1]) -    \
                                 (U[i][j+1] *                   \
                                  V[i][j+1]/H[i][j+1]));
            }
        }

        for(int i=0; i<HEIGHT; i++) {
            for(int j=0; j<WIDTH+1; j++) {
                //height
                Hy[i][j] = (H[i+1][j+1]+H[i+1][j])/2 - dt/(2*dy)*(V[i+1][j+1]-V[i+1][j]);

                //x momentum
                Uy[i][j] = (U[i+1][j+1]+U[i+1][j])/2 -	   \
                dt/(2*dy)*((V[i+1][j+1] *                  \
                           U[i+1][j+1]/H[i+1][j+1]) -	   \
                                (V[i+1][j] *               \
                                 U[i+1][j]/H[i+1][j]));

                //y momentum
                Vy[i][j] = (V[i+1][j+1]+V[i+1][j])/2 -	\
                dt/(2*dy)*((pow(V[i+1][j+1],2)/H[i+1][j+1] +	\
                           g/2*pow(H[i+1][j+1],2)) -	\
                          (pow(V[i+1][j],2)/H[i+1][j] +	\
                           g/2*pow(H[i+1][j],2)));
            }
        }

        //
        // Second half step
        //

        for(int i=1; i<HEIGHT+1; i++) {
            for(int j=1; j<WIDTH+1; j++) {
                //height
                H[i][j] -= (dt/dx)*(Ux[i][j-1]-Ux[i-1][j-1]) - (dt/dy)*(Vy[i-1][j]-Vy[i-1][j-1]);

                // x momentum
                U[i][j] -= (dt/dx)*((pow(Ux[i][j-1],2)/Hx[i][j-1] + g/2*pow(Hx[i][j-1],2)) -        \
                                    (pow(Ux[i-1][j-1],2)/Hx[i-1][j-1] + g/2*pow(Hx[i-1][j-1],2))) - \
                           (dt/dy)*((Vy[i-1][j] * Uy[i-1][j]/Hy[i-1][j]) -
                                    (Vy[i-1][j-1] * Uy[i-1][j-1]/Hy[i-1][j-1]));

                // y momentum    - score
                V[i][j] -= (dt/dx)*((Ux[i][j-1] * Vx[i][j-1]/Hx[i][j-1]) -                         \
                                    (Ux[i-1][j-1]*Vx[i-1][j-1]/Hx[i-1][j-1])) -                    \
                           (dt/dy)*((pow(Vy[i-1][j],2)/Hy[i-1][j] + g/2*pow(Hy[i-1][j],2)) -       \
                                    (pow(Vy[i-1][j-1],2)/Hy[i-1][j-1] + g/2*pow(Hy[i-1][j-1],2)));

            }
        }
    }

    /* NumPy implementation does not do this!?
    // res = numpy.add.reduce(numpy.add.reduce(H / n))
    double res = 0.0;
    for(int i=0;i<HEIGHT+2;i++) {
        for(int j=0;j<WIDTH+2;j++) {
            res+=H[i][j]/WIDTH;
        }
    }
    */

    bp.timer_stop();
    bp.print("shallow_water(cpp11_seq)");
}

Csharp Numcil

#region Copyright
/*
This file is part of Bohrium and copyright (c) 2012 the Bohrium:
team <http://www.bh107.org>.

Bohrium is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as
published by the Free Software Foundation, either version 3
of the License, or (at your option) any later version.

Bohrium is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the
GNU Lesser General Public License along with Bohrium.

If not, see <http://www.gnu.org/licenses/>.
*/
#endregion

//Adapted from: http://people.sc.fsu.edu/~jburkardt/m_src/shallow_water_2d/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using R = NumCIL.Range;
using NumCIL;

namespace shallow_water
{
    public static class ShallowWaterSolver
    {
        /// <summary>
        /// NumCIL equivalent to [:]
        /// </summary>
        public static readonly R ALL = R.All;
        /// <summary>
        /// NumCIL equivalent to [1:-1]
        /// </summary>
        public static readonly R INNER = R.Slice(1, -1);
        /// <summary>
        /// NumCIL equivalent to [0]
        /// </summary>
        public static readonly R FIRST = R.El(0);
        /// <summary>
        /// NumCIL equivalent to [1]
        /// </summary>
        public static readonly R SECOND = R.El(1);
        /// <summary>
        /// NumCIL equivalent to [-1]
        /// </summary>
        public static readonly R LAST = R.El(-1);
        /// <summary>
        /// NumCIL equivalent to [:-1]
        /// </summary>
        public static readonly R ZM1 = R.Slice(0, -1);
        /// <summary>
        /// NumCIL equivalent to [1:]
        /// </summary>
        public static readonly R SKIP1 = R.Slice(1, 0);


        public class DataDouble
        {
            public readonly long N;
            public readonly NumCIL.Double.NdArray H;
            public readonly NumCIL.Double.NdArray U;
            public readonly NumCIL.Double.NdArray V;
            public readonly NumCIL.Double.NdArray Hx;
            public readonly NumCIL.Double.NdArray Ux;
            public readonly NumCIL.Double.NdArray Vx;
            public readonly NumCIL.Double.NdArray Hy;
            public readonly NumCIL.Double.NdArray Uy;
            public readonly NumCIL.Double.NdArray Vy;

            public DataDouble(long n)
            {
                N = n;
                H = NumCIL.Double.Generate.Ones(n + 2, n + 2);
                U = NumCIL.Double.Generate.Ones(n + 2, n + 2);
                V = NumCIL.Double.Generate.Ones(n + 2, n + 2);
                Hx = NumCIL.Double.Generate.Ones(n + 1, n + 1);
                   Ux = NumCIL.Double.Generate.Ones(n + 1, n + 1);
                Vx = NumCIL.Double.Generate.Ones(n + 1, n + 1);
                   Hy = NumCIL.Double.Generate.Ones(n + 1, n + 1);
                Uy = NumCIL.Double.Generate.Ones(n + 1, n + 1);
                Vy = NumCIL.Double.Generate.Ones(n + 1, n + 1);
            }
        }


        public static System.Double SolveDouble(long timesteps, DataDouble data)
        {
            //Convenience indices
            R RN = R.El(data.N); //Same as [n]
            R RN1 = R.El(data.N + 1); //Same as [n+1]

            System.Double g = 9.8f;// gravitational constant
            System.Double dt = 0.02f; // hardwired timestep
            System.Double dx = 1.0f;
            System.Double dy = 1.0f;
            long droploc = data.N / 4;

            var H = data.H;
            var U = data.U;
            var V = data.V;
            var Hx = data.Hx;
            var Ux = data.Ux;
            var Vx = data.Vx;
            var Hy = data.Hy;
            var Uy = data.Uy;
            var Vy = data.Vy;

            //Splash!!!
            H[droploc, droploc] += 5.0f;

            for (int i = 0; i < timesteps; i++)
            {
                H.Flush();
                // Reflecting boundary conditions
                H[ALL, FIRST] = H[ALL, SECOND];
                U[ALL, FIRST] = U[ALL, SECOND];
                V[ALL, FIRST] = -V[ALL, SECOND];
                H[ALL, RN1] = H[ALL, RN];
                U[ALL, RN1] = U[ALL, RN];
                V[ALL, RN1] = -V[ALL, RN];
                H[FIRST, ALL] = H[SECOND, ALL];
                U[FIRST, ALL] = -U[SECOND, ALL];
                V[FIRST, ALL] = V[SECOND, ALL];
                H[RN1, ALL] = H[RN, ALL];
                U[RN1, ALL] = -U[RN, ALL];
                V[RN1, ALL] = V[RN, ALL];

                //First half-step

                //Height
                Hx[ALL, R.Slice(0, -1)] = (H[SKIP1,INNER]+H[ZM1,INNER])/2 -
                        dt/(2*dx)*(U[SKIP1,INNER]-U[ZM1,INNER]);

                //x momentum
                Ux[ALL, R.Slice(0, -1)] = (U[SKIP1,INNER]+U[ZM1,INNER])/2 -
                dt/(2*dx)*((U[SKIP1,INNER].Pow(2)/H[SKIP1,INNER] +
                            g/2*H[SKIP1,INNER].Pow(2)) -
                           (U[ZM1,INNER].Pow(2)/H[ZM1,INNER] +
                            g/2*H[ZM1,INNER].Pow(2)));

                // y momentum
                Vx[ALL, ZM1] = (V[SKIP1,INNER]+V[ZM1,INNER])/2 -
                            dt/(2*dx)*((U[SKIP1,INNER] *
                            V[SKIP1,INNER]/H[SKIP1, INNER]) -
                           (U[ZM1,INNER] *
                            V[ZM1,INNER]/H[ZM1,INNER]));



                // height
                Hy[ZM1,ALL] = (H[INNER,SKIP1]+H[INNER,ZM1])/2 -
                            dt/(2*dy)*(V[INNER,SKIP1]-V[INNER,ZM1]);

                // x momentum
                Uy[ZM1,ALL] = (U[INNER,SKIP1]+U[INNER,ZM1])/2 -
                            dt/(2*dy)*((V[INNER,SKIP1] *
                            U[INNER,SKIP1]/H[INNER,SKIP1]) -
                           (V[INNER,ZM1] *
                            U[INNER,ZM1]/H[INNER,ZM1]));
                // y momentum
                Vy[ZM1,ALL] = (V[INNER,SKIP1]+V[INNER,ZM1])/2 -
                            dt/(2*dy)*((V[INNER,SKIP1].Pow(2)/H[INNER,SKIP1] +
                            g/2*H[INNER,SKIP1].Pow(2)) -
                           (V[INNER,ZM1].Pow(2)/H[INNER,ZM1] +
                            g/2*H[INNER,ZM1].Pow(2)));

                // Second half step

                // height
                H[INNER, INNER] = H[INNER, INNER] -
                            (dt/dx)*(Ux[SKIP1,ZM1]-Ux[ZM1,ZM1]) -
                            (dt/dy)*(Vy[ZM1,SKIP1]-Vy[ZM1, ZM1]);

                // x momentum
                U[INNER, INNER] = U[INNER, INNER] -
                           (dt/dx)*((Ux[SKIP1,ZM1].Pow(2)/Hx[SKIP1,ZM1] +
                             g/2*Hx[SKIP1,ZM1].Pow(2)) -
                            (Ux[ZM1,ZM1].Pow(2)/Hx[ZM1,ZM1] +
                             g / 2 * Hx[ZM1, ZM1].Pow(2))) -
                             (dt/dy)*((Vy[ZM1,SKIP1] *
                                       Uy[ZM1,SKIP1]/Hy[ZM1,SKIP1]) -
                                        (Vy[ZM1,ZM1] *
                                         Uy[ZM1,ZM1]/Hy[ZM1,ZM1]));
                // y momentum
                V[R.Slice(1, -1),R.Slice(1, -1)] = V[R.Slice(1, -1),R.Slice(1, -1)] -
                               (dt/dx)*((Ux[SKIP1,ZM1] *
                                         Vx[SKIP1,ZM1]/Hx[SKIP1,ZM1]) -
                                        (Ux[ZM1,ZM1]*Vx[ZM1,ZM1]/Hx[ZM1,ZM1])) -
                                        (dt / dy) * ((Vy[ZM1, SKIP1].Pow(2) / Hy[ZM1, SKIP1] +
                                                  g / 2 * Hy[ZM1, SKIP1].Pow(2)) -
                                                 (Vy[ZM1, ZM1].Pow(2) / Hy[ZM1, ZM1] +
                                                  g / 2 * Hy[ZM1, ZM1].Pow(2)));
            }
            //Make sure we have the actual data and use it as a checksum
            return NumCIL.Double.Add.Reduce(NumCIL.Double.Add.Reduce(H / data.N)).Value[0];
        }
        public class DataFloat
        {
            public readonly long N;
            public readonly NumCIL.Float.NdArray H;
            public readonly NumCIL.Float.NdArray U;
            public readonly NumCIL.Float.NdArray V;
            public readonly NumCIL.Float.NdArray Hx;
            public readonly NumCIL.Float.NdArray Ux;
            public readonly NumCIL.Float.NdArray Vx;
            public readonly NumCIL.Float.NdArray Hy;
            public readonly NumCIL.Float.NdArray Uy;
            public readonly NumCIL.Float.NdArray Vy;

            public DataFloat(long n)
            {
                N = n;
                H = NumCIL.Float.Generate.Ones(n + 2, n + 2);
                U = NumCIL.Float.Generate.Ones(n + 2, n + 2);
                V = NumCIL.Float.Generate.Ones(n + 2, n + 2);
                Hx = NumCIL.Float.Generate.Ones(n + 1, n + 1);
                   Ux = NumCIL.Float.Generate.Ones(n + 1, n + 1);
                Vx = NumCIL.Float.Generate.Ones(n + 1, n + 1);
                   Hy = NumCIL.Float.Generate.Ones(n + 1, n + 1);
                Uy = NumCIL.Float.Generate.Ones(n + 1, n + 1);
                Vy = NumCIL.Float.Generate.Ones(n + 1, n + 1);
            }
        }


        public static System.Single SolveFloat(long timesteps, DataFloat data)
        {
            //Convenience indices
            R RN = R.El(data.N); //Same as [n]
            R RN1 = R.El(data.N + 1); //Same as [n+1]

            System.Single g = 9.8f;// gravitational constant
            System.Single dt = 0.02f; // hardwired timestep
            System.Single dx = 1.0f;
            System.Single dy = 1.0f;
            long droploc = data.N / 4;

            var H = data.H;
            var U = data.U;
            var V = data.V;
            var Hx = data.Hx;
            var Ux = data.Ux;
            var Vx = data.Vx;
            var Hy = data.Hy;
            var Uy = data.Uy;
            var Vy = data.Vy;

            //Splash!!!
            H[droploc, droploc] += 5.0f;

            for (int i = 0; i < timesteps; i++)
            {
                H.Flush();
                // Reflecting boundary conditions
                H[ALL, FIRST] = H[ALL, SECOND];
                U[ALL, FIRST] = U[ALL, SECOND];
                V[ALL, FIRST] = -V[ALL, SECOND];
                H[ALL, RN1] = H[ALL, RN];
                U[ALL, RN1] = U[ALL, RN];
                V[ALL, RN1] = -V[ALL, RN];
                H[FIRST, ALL] = H[SECOND, ALL];
                U[FIRST, ALL] = -U[SECOND, ALL];
                V[FIRST, ALL] = V[SECOND, ALL];
                H[RN1, ALL] = H[RN, ALL];
                U[RN1, ALL] = -U[RN, ALL];
                V[RN1, ALL] = V[RN, ALL];

                //First half-step

                //Height
                Hx[ALL, R.Slice(0, -1)] = (H[SKIP1,INNER]+H[ZM1,INNER])/2 -
                        dt/(2*dx)*(U[SKIP1,INNER]-U[ZM1,INNER]);

                //x momentum
                Ux[ALL, R.Slice(0, -1)] = (U[SKIP1,INNER]+U[ZM1,INNER])/2 -
                dt/(2*dx)*((U[SKIP1,INNER].Pow(2)/H[SKIP1,INNER] +
                            g/2*H[SKIP1,INNER].Pow(2)) -
                           (U[ZM1,INNER].Pow(2)/H[ZM1,INNER] +
                            g/2*H[ZM1,INNER].Pow(2)));

                // y momentum
                Vx[ALL, ZM1] = (V[SKIP1,INNER]+V[ZM1,INNER])/2 -
                            dt/(2*dx)*((U[SKIP1,INNER] *
                            V[SKIP1,INNER]/H[SKIP1, INNER]) -
                           (U[ZM1,INNER] *
                            V[ZM1,INNER]/H[ZM1,INNER]));



                // height
                Hy[ZM1,ALL] = (H[INNER,SKIP1]+H[INNER,ZM1])/2 -
                            dt/(2*dy)*(V[INNER,SKIP1]-V[INNER,ZM1]);

                // x momentum
                Uy[ZM1,ALL] = (U[INNER,SKIP1]+U[INNER,ZM1])/2 -
                            dt/(2*dy)*((V[INNER,SKIP1] *
                            U[INNER,SKIP1]/H[INNER,SKIP1]) -
                           (V[INNER,ZM1] *
                            U[INNER,ZM1]/H[INNER,ZM1]));
                // y momentum
                Vy[ZM1,ALL] = (V[INNER,SKIP1]+V[INNER,ZM1])/2 -
                            dt/(2*dy)*((V[INNER,SKIP1].Pow(2)/H[INNER,SKIP1] +
                            g/2*H[INNER,SKIP1].Pow(2)) -
                           (V[INNER,ZM1].Pow(2)/H[INNER,ZM1] +
                            g/2*H[INNER,ZM1].Pow(2)));

                // Second half step

                // height
                H[INNER, INNER] = H[INNER, INNER] -
                            (dt/dx)*(Ux[SKIP1,ZM1]-Ux[ZM1,ZM1]) -
                            (dt/dy)*(Vy[ZM1,SKIP1]-Vy[ZM1, ZM1]);

                // x momentum
                U[INNER, INNER] = U[INNER, INNER] -
                           (dt/dx)*((Ux[SKIP1,ZM1].Pow(2)/Hx[SKIP1,ZM1] +
                             g/2*Hx[SKIP1,ZM1].Pow(2)) -
                            (Ux[ZM1,ZM1].Pow(2)/Hx[ZM1,ZM1] +
                             g / 2 * Hx[ZM1, ZM1].Pow(2))) -
                             (dt/dy)*((Vy[ZM1,SKIP1] *
                                       Uy[ZM1,SKIP1]/Hy[ZM1,SKIP1]) -
                                        (Vy[ZM1,ZM1] *
                                         Uy[ZM1,ZM1]/Hy[ZM1,ZM1]));
                // y momentum
                V[R.Slice(1, -1),R.Slice(1, -1)] = V[R.Slice(1, -1),R.Slice(1, -1)] -
                               (dt/dx)*((Ux[SKIP1,ZM1] *
                                         Vx[SKIP1,ZM1]/Hx[SKIP1,ZM1]) -
                                        (Ux[ZM1,ZM1]*Vx[ZM1,ZM1]/Hx[ZM1,ZM1])) -
                                        (dt / dy) * ((Vy[ZM1, SKIP1].Pow(2) / Hy[ZM1, SKIP1] +
                                                  g / 2 * Hy[ZM1, SKIP1].Pow(2)) -
                                                 (Vy[ZM1, ZM1].Pow(2) / Hy[ZM1, ZM1] +
                                                  g / 2 * Hy[ZM1, ZM1].Pow(2)));
            }
            //Make sure we have the actual data and use it as a checksum
            return NumCIL.Float.Add.Reduce(NumCIL.Float.Add.Reduce(H / data.N)).Value[0];
        }
    }
}

Snakes And Ladders

Python Numpy

Error

There are issues with the implementation.

Implementation seems broken when running with Bohrium, it looks like it does nothing but copy data back and forth between NumPy and Bohrium.

# By  Natalino Busa <https://gist.github.com/natalinobusa/4633275>
from __future__ import print_function
from benchpress.benchmarks import util
import numpy as np

bench = util.Benchmark("Snakes and Ladders", "size*iterations")


def nullgame(size, dtype):
    p = np.zeros((size + 1, size + 1), dtype=dtype)

    for i in range(size + 1):
        for j in range(6):
            if i + j < size:
                p[i][i + j + 1] = 1.0 / 6.0

    p[size][size] = 1
    p[size - 1][size] = 6.0 / 6.0
    p[size - 2][size] = 5.0 / 6.0
    p[size - 3][size] = 4.0 / 6.0
    p[size - 4][size] = 3.0 / 6.0
    p[size - 5][size] = 2.0 / 6.0
    p[size - 6][size] = 1.0 / 6.0
    return p


def main():
    size, iterations = bench.args.size

    if bench.args.visualize:
        from matplotlib import pyplot

    a = np.array(np.zeros(size + 1, dtype=bench.dtype))
    p = np.array(nullgame(size, dtype=bench.dtype))
    m = p  # Initial matrix is p
    pr_end = np.array(np.zeros(iterations, dtype=bench.dtype))

    bench.start()
    for k in range(iterations):
        if bench.args.visualize:
            # Plot the probability distribution at the k-th iteration
            pyplot.figure(1)
            pyplot.plot(m[0][0:size])

        # Store the probability of ending after the k-th iteration
        pr_end[k] = m[0][size]

        # Store/plot the accumulated marginal probability at the k-th iteration
        a = a + m[0]

        bench.flush()
        if bench.args.visualize:
            pyplot.figure(2)
            pyplot.plot(a[0:size])

        # calculate the stocastic matrix for iteration k+1
        if util.bh_is_loaded_as_np and bench.args.no_extmethods:
            m = np.array(np.dot(m.copy2numpy(), p.copy2numpy()))
        else:
            m = np.dot(m, p)
    bench.stop()
    bench.pprint()

    # plot the probability of ending the game
    # after k iterations
    if bench.args.visualize:
        pyplot.figure(3)
        pyplot.plot(pr_end[0:iterations - 1])

        # show the three graphs
        pyplot.show()


if __name__ == "__main__":
    main()

Wisp

Python Numpy

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 20 10:47:25 2014
GPL

@author: Rasmus Nordfang
"""

from benchpress.benchmarks import util
import numpy as np
from salt import salt
from temp import temp
from density_imr import dens
from heat_imr import heatcap

bench = util.Benchmark("Wisp", "length*width*steps")

length = 1  # size of the grid unit m
width = 1  # size of the grid unit m

N_L = bench.args.size[0]  # number of grids in Length direction
N_W = bench.args.size[1]  # number of grids in width direction

n_years = 10  # number of years
# parameters

# Generel Start Values
# temp,[C]     density[kg/m^3],  depth,[m]      mass[kg]        Salinity[g/kg]
T_ML = -4;
p_ML = 10;
h_ML = 50;
m_ML = 3000;
S_ML = 33.0  # Mixed Layer
T_PC = 0.1;
p_PC = 20;
h_PC = 350;
m_PC = 2000;
S_PC = 34.5  # PycoCline
T_DP = 5.0;
p_DP = 30;
h_DP = 800;
m_DP = 1000;
S_DP = 32.0  # DeeP atlantic layer
T_AB = 2.0;
p_AB = 40;
h_AB = 3000;
m_AB = 8000;
S_AB = 35.0  # AByssal Layer

T_start = np.array([T_ML, T_PC, T_DP, T_AB])  # temperature
p_start = np.array([p_ML, p_PC, p_DP, p_AB])  # density
m_start = np.array([m_ML, m_PC, m_DP, m_AB])  # mass of layer
S_start = np.array([S_ML, S_PC, S_DP, S_AB])  # Salinity of layer
h = np.array([h_ML, h_PC, h_DP, h_AB])  # depth

hight = np.array([50, 300, 400, 1800])  # [m]

Volume = np.array(length * width * hight)  # [m^3]

pres = np.array([60.4, 360, 814, 3030])  # [kg/m^3] general pressure

# time
year = 60 * 60 * 24 * 365  # [s] seconds a year
dt = year / 200  # [s] Generel time stepsize
dt_top = year / 20  # [s] time stepsize for top layers

dt_temp = dt_top / dt  # fraction between generel stepsize and top stepsize
time = n_years * year  # [s] total time
steps = int(time / dt)  # [] number of timesteps

steps = bench.args.size[2]

# define a total matrix for all timesteps. With big calculations (I should maybe save them differently)
T = np.ones(shape=(steps + 1, N_L, N_W, 4));
T[0, :, :, :] = T_start
p = np.ones(shape=(steps + 1, N_L, N_W, 4));
m = np.ones(shape=(steps + 1, N_L, N_W, 4));
m[0, :, :, :] = m_start
S = np.ones(shape=(steps + 1, N_L, N_W, 4));
S[0, :, :, :] = S_start
c_w = np.ones(shape=(steps + 1, N_L, N_W, 4));  # [J / (K * kg)] water specefic heat

V = np.zeros(shape=(steps + 1, N_L, N_W));
V[0, :, :] = 2.0  # [m^3] Ice volume

h_ice = np.ones(shape=(steps + 1, N_L, N_W));  # [m] ice height
x_ice = np.ones(shape=(steps + 1, N_L, N_W));  # percentage ice cover

I_export = 0  # [m^3] volume of ice exportet away

bench.start()
for i in range(steps):  # Forward euler
    h_ice[i, :, :] = V[i, :, :]  # if V > 0.5 -> h_ice = V and h_ice = 1
    putmask = V[i, :, :] <= 0.5

    h_ice[i, :, :] = ~putmask * h_ice[i, :, :] + putmask * (V[i, :, :] ** 0.5 / 2.0)
    x_ice[i, :, :] = ~putmask * x_ice[i, :, :] + putmask * (2 * h_ice[i, :, :])

    if i % dt_temp == 0:
        p[i, :, :, :] = dens(S[i, :, :, :], T[i, :, :, :], pres[:])  # [kg/m]   (3d [kg/m^3])
        c_w[i, :, :, :] = heatcap(S[i, :, :, :], T[i, :, :, :], pres[:])  # [J/(kg*C)]
        m[i, :, :, :] = hight[:] * p[i, :, :, :]  # [kg]
        [dT, dV] = temp(T[i, :, :, :], p[i, :, :, :], h, m[i, :, :, :], c_w[i, :, :, :], x_ice[i, :, :],
                        0 if i < (0.5 * year / dt) else 1, h_ice[i, :, :], I_export)

        T[i + 1:i + 1 + dt_temp, :, :, :] = T[i, :, :, :] + dt * dT[:, :, :]
        V[i + 1:i + 1 + dt_temp, :, :] = V[i, :, :] + dt * dV[:, :]
        S[i + 1:i + 1 + dt_temp, :, :, :] = salt(S[i, :, :, :], p[i, :, :, :], h, x_ice[i, :, :], I_export,
                                                 dV / dt) * dt + S[i, :, :, :]

    else:
        p[i, :, :, 0] = dens(S[i, :, :, 0], T[i, :, :, 0], pres[0])
        p[i, :, :, 1] = dens(S[i, :, :, 1], T[i, :, :, 1], pres[1])
        c_w[i, :, :, 0] = heatcap(S[i, :, :, 0], T[i, :, :, 0], pres[0])  # [J/(kg*C)]
        c_w[i, :, :, 1] = heatcap(S[i, :, :, 1], T[i, :, :, 1], pres[1])  # [J/(kg*C)]

        m[i, :, :, 0:2] = hight[0:2] * p[i, :, :, 0:2]  # [kg]

        # all layers
        [dT, dV] = temp(T[i, :, :, :], p[i, :, :, :], h, m[i, :, :, :], c_w[i, :, :, :], x_ice[i, :, :],
                        0 if i < (0.5 * year / dt) else 1, h_ice[i, :, :], I_export)
        T[i + 1, :, :, 0] = dT[:, :, 0] * dt + T[i, :, :, 0]
        V[i + 1, :, :] = dV[:, :] * dt + V[i, :, :]

        dS = salt(S[i, :, :, :], p[i, :, :, :], h, x_ice[i, :, :], I_export, dV / dt)
        S[i + 1, :, :, 0] = dS[:, :, 0] * dt + S[i, :, :, 0]

    I_export = 14.22

bench.stop()
bench.pprint()

Xraysim

Python Numpy

#!/usr/bin/python
# -*- coding: utf-8 -*-

"""
Created on Mon Aug 31 14:14:15 2015

@author: Mads Thoudahl

"""
from xraysimphysics import emptyAAscene, addobjtoscene
from xraysimgeometry import coordsAAscene, raygeometry, detectorgeometry, runAABB
import numpy as np
from material import Material
from scene_objects import snake
from benchpress.benchmarks import util

bench = util.Benchmark("X-ray sim", "<screen-res>*<detector-res>*<iterations>")


class Const:
    EPS = 1e-5
    invsqr = 1.0 / (np.pi * 4)


def buildscene(
        scenedefs,  # scenedefinitions
        objlist,  # list of objects, obj=
        verbose=False  # printing status updates along the way
):
    """ Builds and returns a (sparse) scenegrid, and a (dense) voxel array,
        describing what materials are in which voxels

        args:
            scenedefs   in the form [x0,y0,z0,x1,y1,z1,xr,yr,zr], where
                        *0, and *1 is corners spanning axis aligned scene.
                        *r is resolution in the three dimensions.

            objectlist  list of objects in the form
                        obj = [objref, shape, material, referencesystem, specific geometric describers]

            verbose     directions to verboseprint std is False

        returns:
            scenegrid       meshgrid [np[x0..x1], np[y0..y1], np[z0..z1]]
                            list of arrays of shapes [(xr+1), (yr+1), (zr+1)]
            scenematarials  3D array of shape (xr,yr,zr) describing which
                            material inhibits which voxel in the scene """
    # create empty scene
    scenegrid = coordsAAscene(scenedefs)
    scenematerials = emptyAAscene(scenedefs)
    if verbose: print("axisaligned scene generated")

    # habitate scene with objects
    added, discarded = [], []
    for obj in objlist:
        if addobjtoscene(scenedefs, scenematerials, obj):
            added.append(obj[0])
            if verbose: print("object {} included".format(obj[0]))
        else:
            discarded.append(obj[0])
            if verbose: print("object {} NOT included".format(obj[0]))

    if verbose:
        print ("axisaligned scene inhabited with {} objects, {} discarded".format(len(added), len(discarded)))
        print ("objects added: {}".format(added))
        print ("objects discarded: {}".format(discarded))

    return scenegrid, scenematerials


def xraysim(sourcelist,
            detectordeflist,
            scenegrid,
            scenematerials,
            materials,
            scene_resolution,
            detector_resolution):
    """ performs the calculations figuring out what is detected
        INPUT:
        sourcelist: list of np.array([
                        px,py,pz,       position
                        relative_power, relative to other sources simulated
                        energy])        MeV

        detectorlist: list of np.array([
                        px0,py0,pz0,    position lower left corner
                        px1,py1,pz1,    position upper right corner
                        res1,res2 ])    resolution of detector

        scenegrid:      a numpy array 'meshgrid' shaped (xs+1,ys+1,zs+1)
                        containing absolute coordinates of grid at 'intersections'
                        as returned by buildscene

        scenematerials: a numpy array shaped (xs,ys,zs)
                        containing information of which MATERIAL inhibits
                        any voxel, as an integer value.
                        ## for a better model this should also contain
                        ## information of how much of the voxel is filled..

        materials:     a dict containing all materials used in the scenematerials
                       as Material objects

        scene_resolution:    resolution of the scene cubic, e.g. 32 equals 32^3

        detector_resolution: resolution of the detectors squared, e.g. 22 equals 22^2
    """
    ## indexing constants
    power = 3

    # generate an array of endpoints for rays (at the detector)
    detectors = []
    for ddef in detectordeflist:
        detectors.append(detectorgeometry(ddef))

    for source in sourcelist:
        # unpack source
        sx, sy, sz, spower, senergy = source
        rayorigin = sx, sy, sz

        # preprocess the scene physics
        # building a map of attenuation coefficients
        sceneattenuates = np.zeros(scenematerials.shape)

        for material_id in materials.keys():
            sceneattenuates += (scenematerials == material_id) \
                               * materials[material_id].getMu(senergy)

        ret = []
        for pixelpositions, pixelareavector, dshape, result in detectors:
            # do geometry
            rayudirs, raylengths, rayinverse = raygeometry(rayorigin, pixelpositions)
            raydst = runAABB(scenegrid, rayudirs, rayorigin, rayinverse)

            # raydst is now to be correlated with material/attenuation grid
            t = sceneattenuates[..., np.newaxis] * raydst
            # We sums the three dimensions
            t = np.sum(t, axis=(0, 1, 2))
            dtectattenuates = t.reshape(detector_resolution, detector_resolution)
            pixelintensity = ((Const.invsqr * source[power] * np.ones(raylengths.shape[0])[
                ..., np.newaxis]) / raylengths).reshape(dshape)
            area = np.dot(rayudirs, pixelareavector.reshape(3, 1)).reshape(dshape)
            result += pixelintensity * area * np.exp(-dtectattenuates)
            ret.append(result)
            if bench.args.visualize:
                low = np.minimum.reduce(result.flatten())
                high = np.maximum.reduce(result.flatten())
                if util.Benchmark().bohrium:
                    low = low.copy2numpy()
                    high = high.copy2numpy()
                util.plot_surface(result, "2d", 0, low - 0.001 * low, high - 0.5 * high)
                util.plot_surface(result, "2d", 0, low - 0.001 * low, high - 0.5 * high)

    # We return only the result of the detectors
    return ret


def setup(scene_res, detector_res):
    """Returns a scene to xray

       scene_res:    resolution of the scene cubic, e.g. 32 equals 32^3
       detector_res: resolution of the detectors squared, e.g. 22 equals 22^2
    """

    srclist, detectorlist, scenedefs, objectlist = snake(scene_res, detector_res)

    # build a model of all materials
    materials = Material.initAll()

    # build scene
    scenegrid, scenematerials = buildscene(scenedefs, objectlist)

    return (srclist, detectorlist, scenegrid, scenematerials, materials, scene_res, detector_res)


def main():
    scene_res = bench.args.size[0]
    detector_res = bench.args.size[1]
    iterations = bench.args.size[2]
    scene = setup(scene_res, detector_res)

    bench.start()
    for _ in range(iterations):
        detector_results = xraysim(*scene)
        bench.flush()
    bench.stop()
    bench.pprint()
    bench.confirm_exit()


if __name__ == '__main__':
    main()