A good scheduler operates in a distributed fashion, where the threads implementing the scheduler cooperate to load-balance the computation. Provably good online, distributed schedulers exist, but analyzing them is complicated. To keep the analysis simple, one may consider an online centralized scheduler that knows the global state of the computation at any moment.
A child process is one that results from a spawn. It may run in parallel with the procedure that executed the spawn—its parent.
To reduce the overhead of recursive spawning, task-parallel platforms sometimes coarsen the leaves of the recursion by executing several iterations in a single leaf, either automatically or under programmer control. This optimization comes at the expense of reducing the parallelism. If the computation has sufficient parallel slackness, however, near-perfect linear speedup won’t be sacrificed
As a greedy scheduler manages a computation on
See parallel trace.
With some parallel-programming models, a programmer cannot accomplish anything significant without dealing with concurrency, where the programmer herself mitigates interactions between otherwise independent subcomputations.
The critical path of a task-parallel trace is the longest path, weighted by execution time. The weight of the critical path is called the span of the trace.
A determinacy race occurs when two logically parallel instructions access the same memory location and at least one of the instructions modifies the value stored in the location. For a computation with a determinacy race, the results can vary depending on the how the instructions are scheduled on the multicore computer. Often in practice, most instruction orderings produce correct results, but some orderings generate improper results when the instructions interleave. Consequently, races can be extremely hard to test for. Task-parallel programming environments often provide race-detection productivity tools to help you isolate race bugs.
A parallel algorithm is deterministic if it always does the same thing on the same input, no matter how the instructions are scheduled on the multicore computer. It is nondeterministic if its behavior might vary from run to run when the input is the same. A parallel algorithm that is intended to be deterministic may nevertheless act nondeterministically, however, if it contains a difficult-to-diagnose bug called a determinacy race.
Multicore clusters usually have a distributed memory, where one multicore's memory cannot be accessed directly by a processor in another multicore. Instead, the processor must explicitly send a message over the cluster network to a processor in the remote multicore to request any data it requires.
Almost all task-parallel environments support fork-join parallelism, which is typically embodied
in two linguistic features: spawning and parallel loops. Spawning allows a subroutine to be “forked”: executed like a subroutine call, except that the caller can continue to execute while the spawned subroutine computes its result. A parallel
loop is like an ordinary
for loop, except that multiple iterations of the loop can
execute at the same time.
Fork-join parallel algorithms employ spawning and parallel loops to describe parallelism. A key aspect of this parallel model, inherited from the task-parallel model but different from the thread model, is that the programmer does not specify which tasks in a computation must run in parallel, only which tasks may run in parallel. The underlying runtime system uses threads to load-balance the tasks across the processors.
A greedy scheduler assigns as many strands to processors as possible in each time step, never leaving a processor idle if there is work that can be done.
Ideal parallel computer
Our analyses generally assume that parallel algorithms execute on an ideal parallel computer, which consists of a set of processors and a sequentially consistent shared memory. The ideal parallel-computer model also assumes that each processor in the machine has equal computing power, and it ignores the cost of scheduling. This last assumption may sound optimistic, but it turns out that for algorithms with sufficient parallelism, the overhead of scheduling is generally minimal in practice.
As a greedy scheduler manages a computation on
A fork-join parallel trace can be pictured as a dag of strands embedded in an invocation tree of procedure instances. All directed edges connecting strands run either within a procedure or along undirected edges of the invocation tree. More general task-parallel traces that are not fork-join traces may contain some directed edges that do not run along the undirected tree edges.
When the speedup of a computation on
Memory is accessed by load instructions and by store instructions. Load instructions copy data from a location in memory to a register within a processor.
Parallel keywords like
the logical parallelism of a computation, indicating which parts of the computation
may proceed in parallel (without requiring that they must do so). At runtime, it is up to a scheduler to determine
which subcomputations actually run in parallel by assigning them to available processors
as the computation unfolds.
Logically in parallel
Logically in series
Parallel computers—computers with multiple processing units—are ubiquitous. Handheld, laptop, desktop, and cloud machines are all multicore computers, or simply, multicores, containing multiple processing "cores." Each processing core is a full-fledged processor that can directly access any location in a common shared memory.
A parallel algorithm is nondeterministic if its behavior might vary from run to run when the input is the same. It is deterministic if it always does the same thing on the same input, no matter how the instructions are scheduled on the multicore computer. A parallel algorithm that is intended to be deterministic may nevertheless act nondeterministically, however, if it contains a difficult-to-diagnose bug called a determinacy race.
A task-parallel scheduler must must operate online, scheduling the computation without knowing in advance when procedures will be spawned or when they will finish.
Algorithms where multiple instructions can execute simultaneously.
A parallel loop is like an ordinary
except that multiple iterations of the loop can execute at the same time.
You can write parallel loops in OpenCilk with
We define the (parallel) slackness of a task-parallel computation executed on
an ideal parallel computer with
It helps to view the execution of a parallel computation—the dynamic stream of
runtime instructions executed by processors under the direction of a parallel program—as
a directed acyclic graph
A parent process is one that executes a spawn, after which it may continue to execute in parallel with the spawned subroutine—its child.
Perfect linear speedup
The maximum possible speedup of a computation on
which is called perfect linear speedup.
A strand is ready to execute when all strands on which it depends have finished execution.
The scheduler for task-parallel computations determines at runtime which subcomputations actually execute in parallel by assigning them to available processors as the computation unfolds.
Sequential consistency means that even if multiple processors attempt to access memory simultaneously, the shared memory behaves as if exactly one instruction from one of the processors is executed at a time, even though the actual transfer of data may happen at the same time. It is as if the instructions were executed one at a time sequentially according to some global linear order among all the processors that preserves the individual orders in which each processor executes its own instructions.
Serial algorithms are suitable for running on a uniprocessor computer that executes only one instruction at a time.
The serial projection of a parallel algorithm is the serial algorithm that results from ignoring the parallel directives,
In a multicore computer, shared memory can be directly accessed at any location by any of the processing cores.
The span is the fastest possible time to execute the computation on an unlimited number of processors, which corresponds to the sum of the times taken by the strands along a longest path in the trace, where “longest” means that each strand is weighted by its execution time. Such a longest path is called the critical path of the trace, and thus the span is the weight of the longest (weighted) path in the trace.
The span provides a lower bound on the running time
Spawning occurs when the keyword
cilk_spawn precedes a procedure call.
The semantics of a spawn differs from an ordinary procedure call in
that the procedure instance that executes the spawn—the parent—may continue
to execute in parallel with the spawned subroutine—its child—instead of waiting
for the child to finish, as would happen in a serial execution.
We define the speedup of a computation on
Memory is accessed by store instructions and by load instructions. Store instructions copy data from a processor register to a location in the memory.
It’s sometimes inconvenient, especially if we want to focus on the parallel structure
of a computation, for a vertex of a trace to represent only one executed instruction.
Consequently, if a chain of instructions contains no parallel or procedural
cilk_sync, procedure call, or return—via either an explicit
statement or the return that happens implicitly upon reaching the end of a procedure),
we group the entire chain into a single strand.
Strands do not include instructions that involve parallel or procedural
control. These control dependencies must be represented as edges in the trace.
Task-parallel platforms, programming, and algorithms
Task-parallel platforms provide a layer of software on top of threads to coordinate, schedule, and manage the processors of a multicore. Some task-parallel platforms are built as runtime libraries, but others provide full-fledged parallel languages with compiler and runtime support.
Task-parallel programming allows parallelism to be specified in a “processor-oblivious” fashion, where the programmer identifies what computational tasks may run in parallel but does not indicate which thread or processor performs the task. Thus, the programmer is freed from worrying about communication protocols, load balancing, and other vagaries of thread programming. The task-parallel platform contains a scheduler, which automatically load-balances the tasks across the processors, thereby greatly simplifying the programmer’s chore. Task-parallel algorithms provide a natural extension to ordinary serial algorithms, allowing performance to be reasoned about mathematically using “work/span analysis.”
Threads and thread parallelism
One approach to programming multicores is thread parallelism. This processor-centric parallel-programming model employs a software abstraction of "virtual processors," or threads that share a common memory. Each thread maintains its own program counter and can execute code independently of the other threads. The operating system loads a thread onto a processing core for execution and switches it out when another thread needs to run.
The work of a task-parallel computation is the total time to execute the entire computation on one processor. In other words, the work is the sum of the times taken by each of the strands. If each strand takes unit time, the work is just the number of vertices in the trace.
The work provides a lower bound on the running time