Getting started

First, install OpenCilk. Then, download the tutorial code examples and enter the cloned directory:

$ git clone https://github.com/OpenCilk/tutorial
$ cd tutorial

Let us walk through the steps of building, running, and testing a program with OpenCilk.

Note: The rest of this guide assumes that OpenCilk is installed within /opt/opencilk/ and that clang points to the OpenCilk C compiler at /opt/opencilk/bin/clang.

Using the compiler

To compile a Cilk program with OpenCilk, pass the -fopencilk flag to Clang (or Clang++). For example:

$ clang -fopencilk -O3 fib.c -o fib

Note: Pass the -fopencilk flag to the compiler both when compiling and linking the Cilk program. During compilation, the flag ensures that the Cilk keywords are recognized and compiled. During linking, the flag links ensures the program is properly linked with the OpenCilk runtime library. (Former users of Intel Cilk Plus with GCC: make sure you do not include the -lcilkrts flag when linking.)

The OpenCilk compiler is based on a recent stable version of the LLVM clang compiler. It supports all compiler flags and features that LLVM clang supports, including optimization-level flags, debug-information flags, and target-dependent compilation options. See the Clang documentation for more information on the command-line arguments.

macOS

On macOS, clang needs standard system libraries and headers which are provided by XCode or the XCode Command Line Tools. To run the OpenCilk compiler with those libraries and headers, invoke the clang binary with xcrun. For example:

$ xcrun clang -fopencilk -O3 fib.c -o fib

Running the program on multiple cores

The program will automatically execute in parallel, using all available cores.

$ ./fib 35
fib(35) = 9227465

To explicitly set the number of parallel Cilk workers for a program execution, set the CILK_NWORKERS environment variable. For example:

$ CILK_NWORKERS=2 ./fib 35
fib(35) = 9227465

Using Cilksan

Use the OpenCilk Cilksan race detector to verify that your parallel Cilk program is deterministic. Cilksan instruments a program to detect determinacy race bugs at runtime. It is guaranteed to find any and all determinacy races that arise in a given program execution. If there are no races, Cilksan will report that the execution was race-free.

To check for determinacy races with Cilksan, add the -fsanitize=cilk flag during compilation and linking. We also recommend the -Og -g flags for debugging:

$ clang -fopencilk -fsanitize=cilk -Og -g nqueens.c -o nqueens

The nqueens.c code in this example contains a subtle determinacy race bug. Running the Cilksan-instrumented nqueens program produces the following output which shows us how two parallel strands attempt to read from and write to the same memory address (through variables a and b, respectively).

$ ./nqueens 12
Running Cilksan race detector.
Running ./nqueens with n = 12.
Race detected on location 7f515c3f34f6
*     Read 4994b3 nqueens /home/user/opencilk/tutorial/nqueens.c:62:3
|        `-to variable a (declared at /home/user/opencilk/tutorial/nqueens.c:48)
+     Call 499da5 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
+    Spawn 4995b3 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
|*   Write 499586 nqueens /home/user/opencilk/tutorial/nqueens.c:65:10
||       `-to variable b (declared at /home/user/opencilk/tutorial/nqueens.c:50)
\| Common calling context
 +    Call 499da5 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
 +   Spawn 4995b3 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
[...output truncated...]
   Allocation context
    Stack object b (declared at /home/user/opencilk/tutorial/nqueens.c:50)
     Alloc 499493 in nqueens /home/user/opencilk/tutorial/nqueens.c:61:16
      Call 499da5 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
     Spawn 4995b3 nqueens /home/user/opencilk/tutorial/nqueens.c:67:29
[...output truncated...]

1.137000
Total number of solutions : 14200

Cilksan detected 1 distinct races.
Cilksan suppressed 781409 duplicate race reports.

Programs instrumented with Cilksan are always run serially, regardless of the number of processors that are available or specified. The instrumented program is expected to run up to several times slower than its non-instrumented serial counterpart.

Note: On macOS, the compiled nqueens.c binary uses builtins that Cilksan does not currently recognize. To work around this behavior, add the flag –D_FORTIFY_SOURCE=0 when compiling:

$ clang -fopencilk -fsanitize=cilk -Og -g -D_FORTIFY_SOURCE=0 nqueens.c -o nqueens

Using Cilkscale

Use the OpenCilk Cilkscale scalability analyzer and benchmarking script to measure the work, span, and parallelism of your Cilk program, and to benchmark parallel speedup on different numbers of cores.

To measure work and span with Cilkscale, add the -fcilktool=cilkscale flag during compilation and linking:

$ clang -fopencilk -fcilktool=cilkscale -O3 qsort.c -o qsort

Running the Cilkscale-instrumented program will output work, span, and parallelism measurements in CSV format at the end of the execution. For example:

$ ./qsort 10000000
Sorting 10000000 integers
All sorts succeeded
tag,work (seconds),span (seconds),parallelism,burdened_span (seconds),burdened_parallelism
,14.511,0.191245,75.8764,0.191514,75.7699

To output the Cilkscale measurements to a file, set the CILKSCALE_OUT environment variable:

$ CILKSCALE_OUT=qsort_workspan.csv ./qsort 10000000
Sorting 10000000 integers
All sorts succeeded
$ cat qsort_workspan.csv
tag,work (seconds),span (seconds),parallelism,burdened_span (seconds),burdened_parallelism
,13.9352,0.177858,78.35,0.178218,78.1917

Analyzing a region

By default, Cilkscale will only analyze whole-program execution. To analyze specific regions of a program, annotate the code accordingly using the Cilkscale API.

For example, to measure the work and span of the core function sample_qsort in the tutorial qsort.c code:

#include <cilk/cilkscale.h>int main(int argc, char **argv) {
  // …
  wsp_t start, end;
  start = wsp_getworkspan();
  sample_qsort(a, a + n); /* <-- analyze this */
  end = wsp_getworkspan(); 
  // …
  wsp_dump(wsp_sub(end, start), "sample_qsort");
  // …
}

Then, recompile with Cilkscale and rerun:

$ clang -fopencilk -fcilktool=cilkscale -O3 qsort.c -o qsort
$ ./qsort 10000000
Sorting 10000000 integers
All sorts succeeded
tag,work (seconds),span (seconds),parallelism,burdened_span (seconds),burdened_parallelism
sample_qsort,14.3595,0.13184,108.916,0.132084,108.715
,14.412,0.184341,78.181,0.184585,78.0777

Every analyzed region appears as a separate row in the Cilkscale CSV output, tagged with the string that was passed in the corresponding call to wsp_dump().

Scalability benchmarking and visualization

Cilkscale can also be used to benchmark and plot the execution time of your program (and each analyzed region) on different numbers of processors.

First, build your program twice,

  • once with -fcilktool=cilkscale, and
  • once with -fcilktool=cilkscale-benchmark:
$ clang -fopencilk -fcilktool=cilkscale -O3 qsort.c -o qsort
$ clang -fopencilk -fcilktool=cilkscale-benchmark -O3 qsort.c -o qsort-bench

Then, run the program with the Cilkscale benchmarking and visualizer Python script, which is found at share/Cilkscale_vis/cilkscale.py within the OpenCilk installation directory:

$ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py \
    -c qsort -b qsort-bench --args 10000000

This will first measure work, span, and parallelism; run the program with , , ..., Cilk workers (where is the number of available physical cores) and time the execution; and output the results as a CSV table (out.csv) and as plots in a PDF document (plot.pdf):

$ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py -c qsort -b qsort-bench --args 10000000
Namespace(args=['10000000'], cilkscale='./qsort', cilkscale_benchmark='./qsort_bench',
cpu_counts=None, output_csv='out.csv', output_plot='plot.pdf', rows_to_plot='all')

>> STDOUT (./qsort 10000000)
Sorting 10000000 integers
All sorts succeeded
<< END STDOUT

>> STDERR (./qsort 10000000)
<< END STDERR

INFO:runner:Generating scalability data for 8 cpus.
INFO:runner:CILK_NWORKERS=1 taskset -c 0 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=2 taskset -c 0,2 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=3 taskset -c 0,2,4 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=4 taskset -c 0,2,4,6 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=5 taskset -c 0,2,4,6,8 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=6 taskset -c 0,2,4,6,8,10 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=7 taskset -c 0,2,4,6,8,10,12 ./qsort_bench 10000000
INFO:runner:CILK_NWORKERS=8 taskset -c 0,2,4,6,8,10,12,14 ./qsort_bench 10000000
INFO:plotter:Generating plot

To see all options of the Cilkscale cilkscale.py script, pass it the --help argument:

$ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py --help