Glossary

Centralized

A good scheduler operates in a distributed fashion, where the threads implementing the scheduler cooperate to load-balance the computation. Provably good online, distributed schedulers exist, but analyzing them is complicated. To keep the analysis simple, one may consider an online centralized scheduler that knows the global state of the computation at any moment.

Child

A child process is one that results from a spawn. It may run in parallel with the procedure that executed the spawn—its parent.

Coarsen

To reduce the overhead of recursive spawning, task-parallel platforms sometimes coarsen the leaves of the recursion by executing several iterations in a single leaf, either automatically or under programmer control. This optimization comes at the expense of reducing the parallelism. If the computation has sufficient parallel slackness, however, near-perfect linear speedup won’t be sacrificed

Complete step

As a greedy scheduler manages a computation on processors, each step is classified as complete or incomplete. In a complete step, at least strands are ready to execute, meaning that all strands on which they depend have finished execution. A greedy scheduler assigns any P of the ready strands to the processors, completely utilizing all the processor resources.

Computation dag

See parallel trace.

Concurrency

With some parallel-programming models, a programmer cannot accomplish anything significant without dealing with concurrency, where the programmer herself mitigates interactions between otherwise independent subcomputations.

Critical path

The critical path of a task-parallel trace is the longest path, weighted by execution time. The weight of the critical path is called the span of the trace.

Determinacy race

A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions modifies the value stored in the location. For a computation with a determinacy race, the results can vary depending on the how the instructions are scheduled on the multicore computer. Often in practice, most instruction orderings produce correct results, but some orderings generate improper results when the instructions interleave. Consequently, races can be extremely hard to test for. Task-parallel programming environments often provide race-detection productivity tools to help you isolate race bugs.

Deterministic

A parallel algorithm is deterministic if it always does the same thing on the same input, no matter how the instructions are scheduled on the multicore computer. It is nondeterministic if its behavior might vary from run to run when the input is the same. A parallel algorithm that is intended to be deterministic may nevertheless act nondeterministically, however, if it contains a difficult-to-diagnose bug called a determinacy race.

Distributed memory

Multicore clusters usually have a distributed memory, where one multicore's memory cannot be accessed directly by a processor in another multicore. Instead, the processor must explicitly send a message over the cluster network to a processor in the remote multicore to request any data it requires.

Fork-join parallelism

Almost all task-parallel environments support fork-join parallelism, which is typically embodied in two linguistic features: spawning and parallel loops. Spawning allows a subroutine to be “forked”: executed like a subroutine call, except that the caller can continue to execute while the spawned subroutine computes its result. A parallel loop is like an ordinary for loop, except that multiple iterations of the loop can execute at the same time.

Fork-join parallel algorithms employ spawning and parallel loops to describe parallelism. A key aspect of this parallel model, inherited from the task-parallel model but different from the thread model, is that the programmer does not specify which tasks in a computation must run in parallel, only which tasks may run in parallel. The underlying runtime system uses threads to load-balance the tasks across the processors.

Greedy scheduler

A greedy scheduler assigns as many strands to processors as possible in each time step, never leaving a processor idle if there is work that can be done.

Ideal parallel computer

Our analyses generally assume that parallel algorithms execute on an ideal parallel computer, which consists of a set of processors and a sequentially consistent shared memory. The ideal parallel-computer model also assumes that each processor in the machine has equal computing power, and it ignores the cost of scheduling. This last assumption may sound optimistic, but it turns out that for algorithms with sufficient parallelism, the overhead of scheduling is generally minimal in practice.

Incomplete step

As a greedy scheduler manages a computation on processors, each step is classified as complete or incomplete. In an incomplete step, fewer than strands are ready to execute. A greedy scheduler assigns each ready strand to its own processor, leaving some processors idle for the step, but executing all the ready strands.

Invocation tree

A fork-join parallel trace can be pictured as a dag of strands embedded in an invocation tree of procedure instances. All directed edges connecting strands run either within a procedure or along undirected edges of the invocation tree. More general task-parallel traces that are not fork-join traces may contain some directed edges that do not run along the undirected tree edges.

Linear speedup

When the speedup of a computation on processesors is linear in the number of processors, that is, when , the computation exhibits linear speedup.

Load instructions

Memory is accessed by load instructions and by store instructions. Load instructions copy data from a location in memory to a register within a processor.

Logical parallelism

Parallel keywords like cilk_spawn, cilk_scope, cilk_sync, and cilk_for express the logical parallelism of a computation, indicating which parts of the computation may proceed in parallel (without requiring that they must do so). At runtime, it is up to a scheduler to determine which subcomputations actually run in parallel by assigning them to available processors as the computation unfolds.

Logically in parallel

If and are strands in parallel trace , and contains no directed path from to or from to , then the strands are (logically) in parallel.

Logically in series

If and are strands in parallel trace , and contains a directed path from to , then the strands are (logically) in series.

Multicores

Parallel computers—computers with multiple processing units—are ubiquitous. Handheld, laptop, desktop, and cloud machines are all multicore computers, or simply, multicores, containing multiple processing "cores." Each processing core is a full-fledged processor that can directly access any location in a common shared memory.

Nondeterministic

A parallel algorithm is nondeterministic if its behavior might vary from run to run when the input is the same. It is deterministic if it always does the same thing on the same input, no matter how the instructions are scheduled on the multicore computer. A parallel algorithm that is intended to be deterministic may nevertheless act nondeterministically, however, if it contains a difficult-to-diagnose bug called a determinacy race.

Online

A task-parallel scheduler must must operate online, scheduling the computation without knowing in advance when procedures will be spawned or when they will finish.

Parallel algorithms

Algorithms where multiple instructions can execute simultaneously.

Parallel loops

A parallel loop is like an ordinary for loop, except that multiple iterations of the loop can execute at the same time. You can write parallel loops in OpenCilk with cilk_for.

Parallel slackness

We define the (parallel) slackness of a task-parallel computation executed on an ideal parallel computer with processors to be the ratio , which is the factor by which the parallelism of the computation exceeds the number of processors in the machine. If the slackness is less than 1, perfect linear speedup is impossible.

Parallel trace

It helps to view the execution of a parallel computation—the dynamic stream of runtime instructions executed by processors under the direction of a parallel program—as a directed acyclic graph , called a (parallel) trace. Conceptually, the vertices in are executed instructions, and the edges in represent dependencies between instructions, where means that the parallel program required instruction to execute before instruction .

Parallelism

The ratio of the work to the span gives the parallelism of a parallel computation. We can view the parallelism from three perspectives. As a ratio, the parallelism denotes the average amount of work that can be performed in parallel for each step along the critical path. As an upper bound, the parallelism gives the maximum possible speedup that can be achieved on any number of processors. Perhaps most important, the parallelism provides a limit on the possibility of attaining perfect linear speedup. Specifically, once the number of processors exceeds the parallelism, the computation cannot possibly achieve perfect linear speedup.

Parent

A parent process is one that executes a spawn, after which it may continue to execute in parallel with the spawned subroutine—its child.

Perfect linear speedup

The maximum possible speedup of a computation on processors is

,

which is called perfect linear speedup.

Ready

A strand is ready to execute when all strands on which it depends have finished execution.

Scheduler

The scheduler for task-parallel computations determines at runtime which subcomputations actually execute in parallel by assigning them to available processors as the computation unfolds.

Sequential consistency

Sequential consistency means that even if multiple processors attempt to access memory simultaneously, the shared memory behaves as if exactly one instruction from one of the processors is executed at a time, even though the actual transfer of data may happen at the same time. It is as if the instructions were executed one at a time sequentially according to some global linear order among all the processors that preserves the individual orders in which each processor executes its own instructions.

Serial algorithms

Serial algorithms are suitable for running on a uniprocessor computer that executes only one instruction at a time.

Serial projection

The serial projection of a parallel algorithm is the serial algorithm that results from ignoring the parallel directives, such as cilk_spawn, cilk-sync, and cilk_for.

Shared memory

In a multicore computer, shared memory can be directly accessed at any location by any of the processing cores.

Span

The span is the fastest possible time to execute the computation on an unlimited number of processors, which corresponds to the sum of the times taken by the strands along a longest path in the trace, where “longest” means that each strand is weighted by its execution time. Such a longest path is called the critical path of the trace, and thus the span is the weight of the longest (weighted) path in the trace.

Span law

The span provides a lower bound on the running time of a task-parallel computation on processors. A -processor ideal parallel computer cannot run any faster than a machine with an unlimited number of processors. Looked at another way, a machine with an unlimited number of processors can emulate a -processor machine by using just of its processors. Thus, the span law follows:

.

Spawning

Spawning occurs when the keyword cilk_spawn precedes a procedure call. The semantics of a spawn differs from an ordinary procedure call in that the procedure instance that executes the spawn—the parent—may continue to execute in parallel with the spawned subroutine—its child—instead of waiting for the child to finish, as would happen in a serial execution.

Speedup

We define the speedup of a computation on processors by the ratio , which says how many times faster the computation runs on processors than on one processor. By the work law, we have , which implies that . Thus, the speedup on a -processor ideal parallel computer can be at most .

Store instructions

Memory is accessed by store instructions and by load instructions. Store instructions copy data from a processor register to a location in the memory.

Strand

It’s sometimes inconvenient, especially if we want to focus on the parallel structure of a computation, for a vertex of a trace to represent only one executed instruction. Consequently, if a chain of instructions contains no parallel or procedural control (no cilk_spawn, cilk_sync, procedure call, or return—via either an explicit return statement or the return that happens implicitly upon reaching the end of a procedure), we group the entire chain into a single strand. Strands do not include instructions that involve parallel or procedural control. These control dependencies must be represented as edges in the trace.

Task-parallel platforms, programming, and algorithms

Task-parallel platforms provide a layer of software on top of threads to coordinate, schedule, and manage the processors of a multicore. Some task-parallel platforms are built as runtime libraries, but others provide full-fledged parallel languages with compiler and runtime support.

Task-parallel programming allows parallelism to be specified in a “processor-oblivious” fashion, where the programmer identifies what computational tasks may run in parallel but does not indicate which thread or processor performs the task. Thus, the programmer is freed from worrying about communication protocols, load balancing, and other vagaries of thread programming. The task-parallel platform contains a scheduler, which automatically load-balances the tasks across the processors, thereby greatly simplifying the programmer’s chore. Task-parallel algorithms provide a natural extension to ordinary serial algorithms, allowing performance to be reasoned about mathematically using “work/span analysis.”

Threads and thread parallelism

One approach to programming multicores is thread parallelism. This processor-centric parallel-programming model employs a software abstraction of "virtual processors," or threads that share a common memory. Each thread maintains its own program counter and can execute code independently of the other threads. The operating system loads a thread onto a processing core for execution and switches it out when another thread needs to run.

Work

The work of a task-parallel computation is the total time to execute the entire computation on one processor. In other words, the work is the sum of the times taken by each of the strands. If each strand takes unit time, the work is just the number of vertices in the trace.

Work law

The work provides a lower bound on the running time of a task-parallel computation on processors: In one step, an ideal parallel computer with processors can do at most units of work, and thus in time, it can perform at most work. Since the total work to do is , we have . Dividing by yields the work law:

.