# Prefix sum

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes (running totals) of the input sequence: Prefix sum_sentence_0

Prefix sum_description_list_0

• y0 = x0Prefix sum_item_0_0
• y1 = x0 + x1Prefix sum_item_0_1
• y2 = x0 + x1+ x2Prefix sum_item_0_2
• ...Prefix sum_item_0_3

For instance, the prefix sums of the natural numbers are the triangular numbers: Prefix sum_sentence_1

Prefix sum_description_list_1

• input numbers  1  2  3  4  5  6 ... prefix sums  1  3  6 10 15 21 ...Prefix sum_item_1_4

Prefix sums are trivial to compute in sequential models of computation, by using the formula yi = yi − 1 + xi to compute each output value in sequence order. Prefix sum_sentence_2

However, despite their ease of computation, prefix sums are a useful primitive in certain algorithms such as counting sort, and they form the basis of the scan higher-order function in functional programming languages. Prefix sum_sentence_3

Prefix sums have also been much studied in parallel algorithms, both as a test problem to be solved and as a useful primitive to be used as a subroutine in other parallel algorithms. Prefix sum_sentence_4

Abstractly, a prefix sum requires only a binary associative operator ⊕, making it useful for many applications from calculating well-separated pair decompositions of points to string processing. Prefix sum_sentence_5

Mathematically, the operation of taking prefix sums can be generalized from finite to infinite sequences; in that context, a prefix sum is known as a partial sum of a series. Prefix sum_sentence_6

Prefix summation or partial summation form linear operators on the vector spaces of finite or infinite sequences; their inverses are finite difference operators. Prefix sum_sentence_7

## Scan higher order function Prefix sum_section_0

In functional programming terms, the prefix sum may be generalized to any binary operation (not just the addition operation); the higher order function resulting from this generalization is called a scan, and it is closely related to the fold operation. Prefix sum_sentence_8

Both the scan and the fold operations apply the given binary operation to the same sequence of values, but differ in that the scan returns the whole sequence of results from the binary operation, whereas the fold returns only the final result. Prefix sum_sentence_9

For instance, the sequence of factorial numbers may be generated by a scan of the natural numbers using multiplication instead of addition: Prefix sum_sentence_10

Prefix sum_description_list_2

• input numbers  1  2  3  4   5   6 ... prefix products  1  2  6 24 120 720 ...Prefix sum_item_2_5

### Inclusive and exclusive scans Prefix sum_section_1

The following table lists examples of the inclusive and exclusive scan functions provided by a few programming languages and libraries: Prefix sum_sentence_11

Prefix sum_description_list_3

• Language/library Inclusive scan Exclusive scan Haskell scanl1 scanl MPI MPI_Scan MPI_Exscan C++ std::inclusive_scan std::exclusive_scan Scala scan RustPrefix sum_item_3_6

## Parallel algorithms Prefix sum_section_2

There are two key algorithms for computing a prefix sum in parallel. Prefix sum_sentence_12

The first offers a shorter span and more parallelism but is not work-efficient. Prefix sum_sentence_13

The second is work-efficient but requires double the span and offers less parallelism. Prefix sum_sentence_14

These are presented in turn below. Prefix sum_sentence_15

### Algorithm 1: Shorter span, more parallel Prefix sum_section_3

Hillis and Steele present the following parallel prefix sum algorithm: Prefix sum_sentence_16

Given n processors to perform each iteration of the inner loop in constant time, the algorithm as a whole runs in O(log n) time, the number of iterations of the outer loop. Prefix sum_sentence_17

### Algorithm 2: Work-efficient Prefix sum_section_4

A work-efficient parallel prefix sum can be computed by the following steps. Prefix sum_sentence_18

Prefix sum_ordered_list_4

1. Compute the sums of consecutive pairs of items in which the first item of the pair has an even index: z0 = x0 + x1, z1 = x2 + x3, etc.Prefix sum_item_4_7
2. Recursively compute the prefix sum w0, w1, w2, ... of the sequence z0, z1, z2, ...Prefix sum_item_4_8
3. Express each term of the final sequence y0, y1, y2, ... as the sum of up to two terms of these intermediate sequences: y0 = x0, y1 = z0, y2 = z0 + x2, y3 = w0, etc. After the first value, each successive number yi is either copied from a position half as far through the w sequence, or is the previous value added to one value in the x sequence.Prefix sum_item_4_9

If the input sequence has n steps, then the recursion continues to a depth of O(log n), which is also the bound on the parallel running time of this algorithm. Prefix sum_sentence_19

The number of steps of the algorithm is O(n), and it can be implemented on a parallel random access machine with O(n/log n) processors without any asymptotic slowdown by assigning multiple indices to each processor in rounds of the algorithm for which there are more elements than processors. Prefix sum_sentence_20

### Discussion Prefix sum_section_5

Each of the preceding algorithms runs in O(log n) time. Prefix sum_sentence_21

However, the former takes exactly log2 n steps, while the latter requires 2 log2 n − 2 steps. Prefix sum_sentence_22

For the 16-input examples illustrated, Algorithm 1 is 12-way parallel (49 units of work divided by a span of 4) while Algorithm 2 is only 4-way parallel (26 units of work divided by a span of 6). Prefix sum_sentence_23

However, Algorithm 2 is work-efficient—it performs only a constant factor (2) of the amount of work required by the sequential algorithm—while Algorithm 1 is work-inefficient—it performs asymptotically more work (a logarithmic factor) than is required sequentially. Prefix sum_sentence_24

Consequently, Algorithm 1 is likely to perform better when abundant parallelism is available, but Algorithm 2 is likely to perform better when parallelism is more limited. Prefix sum_sentence_25

Parallel algorithms for prefix sums can often be generalized to other scan operations on associative binary operations, and they can also be computed efficiently on modern parallel hardware such as a GPU. Prefix sum_sentence_26

The idea of building in hardware a functional unit dedicated to computing multi-parameter prefix-sum was patented by Uzi Vishkin. Prefix sum_sentence_27

Many parallel implementations follow a two pass procedure where partial prefix sums are calculated in the first pass on each processing unit; the prefix sum of these partial sums is then calculated and broadcast back to the processing units for a second pass using the now known prefix as the initial value. Prefix sum_sentence_28

Asymptotically this method takes approximately two read operations and one write operation per item. Prefix sum_sentence_29

### Concrete implementations of prefix sum algorithms Prefix sum_section_6

An implementation of a parallel prefix sum algorithm, like other parallel algorithms, has to take the parallelization architecture of the platform into account. Prefix sum_sentence_30

More specifically, multiple algorithms exist which are adapted for platforms working on shared memory as well as algorithms which are well suited for platforms using distributed memory, relying on message passing as the only form of interprocess communication. Prefix sum_sentence_31

#### Shared memory: Two-level algorithm Prefix sum_section_7

The following algorithm assumes a shared memory machine model; all processing elements (PEs) have access to the same memory. Prefix sum_sentence_32

A version of this algorithm is implemented in the Multi-Core Standard Template Library (MCSTL), a parallel implementation of the C++ standard template library which provides adapted versions for parallel computing of various algorithms. Prefix sum_sentence_33

In a first sweep, each PE calculates a local prefix sum for its block. Prefix sum_sentence_34

The last block does not need to be calculated, since these prefix sums are only calculated as offsets to the prefix sums of succeeding blocks and the last block is by definition not succeeded. Prefix sum_sentence_35

A second sweep is performed. Prefix sum_sentence_36

This time the first block does not have to be processed, since it does not need to account for the offset of a preceding block. Prefix sum_sentence_37

However, in this sweep the last block is included instead and the prefix sums for each block are calculated taking the prefix sum block offsets calculated in the previous sweep into account. Prefix sum_sentence_38

#### Large message sizes: pipelined binary tree Prefix sum_section_9

The Pipelined Binary Tree Algorithm is another algorithm for distributed memory platforms which is specifically well suited for large message sizes. Prefix sum_sentence_39

Like the hypercube algorithm, it assumes a special communication structure. Prefix sum_sentence_40

The processing elements (PEs) are hypothetically arranged in a binary tree (e.g. a Fibonacci Tree) with infix numeration according to their index within the PEs. Prefix sum_sentence_41

Communication on such a tree always occurs between parent and child nodes. Prefix sum_sentence_42

Note the distinction between subtree-local and total prefix sums. Prefix sum_sentence_43

The points two, three and four can lead to believe they would form a circular dependency, but this is not the case. Prefix sum_sentence_44

Lower level PEs might require the total prefix sum of higher level PEs to calculate their total prefix sum, but higher level PEs only require subtree local prefix sums to calculate their total prefix sum. Prefix sum_sentence_45

The root node as highest level node only requires the local prefix sum of its left subtree to calculate its own prefix sum. Prefix sum_sentence_46

Each PE on the path from PE0 to the root PE only requires the local prefix sum of its left subtree to calculate its own prefix sum, whereas every node on the path from PEp-1 (last PE) to the PEroot requires the total prefix sum of its parent to calculate its own total prefix sum. Prefix sum_sentence_47

This leads to a two-phase algorithm: Prefix sum_sentence_48

Note that the algorithm is run in parallel at each PE and the PEs will block upon receive until their children/parents provide them with packets. Prefix sum_sentence_49

##### Pipelining Prefix sum_section_10

The algorithm can further be optimised by making use of full-duplex or telephone model communication and overlapping the upward and the downward phase. Prefix sum_sentence_50

## Data structures Prefix sum_section_11

When a data set may be updated dynamically, it may be stored in a Fenwick tree data structure. Prefix sum_sentence_51

This structure allows both the lookup of any individual prefix sum value and the modification of any array value in logarithmic time per operation. Prefix sum_sentence_52

However, an earlier 1982 paper presents a data structure called Partial Sums Tree (see Section 5.1) that appears to overlap Fenwick trees; in 1982 the term prefix-sum was not yet as common as it is today. Prefix sum_sentence_53

For higher-dimensional arrays, the summed area table provides a data structure based on prefix sums for computing sums of arbitrary rectangular subarrays. Prefix sum_sentence_54

This can be a helpful primitive in image convolution operations. Prefix sum_sentence_55

## Applications Prefix sum_section_12

Counting sort is an integer sorting algorithm that uses the prefix sum of a histogram of key frequencies to calculate the position of each key in the sorted output array. Prefix sum_sentence_56

It runs in linear time for integer keys that are smaller than the number of items, and is frequently used as part of radix sort, a fast algorithm for sorting integers that are less restricted in magnitude. Prefix sum_sentence_57

List ranking, the problem of transforming a linked list into an array that represents the same sequence of items, can be viewed as computing a prefix sum on the sequence 1, 1, 1, ... and then mapping each item to the array position given by its prefix sum value; by combining list ranking, prefix sums, and Euler tours, many important problems on trees may be solved by efficient parallel algorithms. Prefix sum_sentence_58

An early application of parallel prefix sum algorithms was in the design of binary adders, Boolean circuits that can add two n-bit binary numbers. Prefix sum_sentence_59

In this application, the sequence of carry bits of the addition can be represented as a scan operation on the sequence of pairs of input bits, using the majority function to combine the previous carry with these two bits. Prefix sum_sentence_60

Each bit of the output number can then be found as the exclusive or of two input bits with the corresponding carry bit. Prefix sum_sentence_61

By using a circuit that performs the operations of the parallel prefix sum algorithm, it is possible to design an adder that uses O(n) logic gates and O(log n) time steps. Prefix sum_sentence_62

In the parallel random access machine model of computing, prefix sums can be used to simulate parallel algorithms that assume the ability for multiple processors to access the same memory cell at the same time, on parallel machines that forbid simultaneous access. Prefix sum_sentence_63

By means of a sorting network, a set of parallel memory access requests can be ordered into a sequence such that accesses to the same cell are contiguous within the sequence; scan operations can then be used to determine which of the accesses succeed in writing to their requested cells, and to distribute the results of memory read operations to multiple processors that request the same result. Prefix sum_sentence_64

In Guy Blelloch's Ph.D. thesis, parallel prefix operations form part of the formalization of the Data parallelism model provided by machines such as the Connection Machine. Prefix sum_sentence_65

The Connection Machine CM-1 and CM-2 provided a hypercubic network on which the Algorithm 1 above could be implemented, whereas the CM-5 provided a dedicated network to implement Algorithm 2. Prefix sum_sentence_66

In the construction of Gray codes, sequences of binary values with the property that consecutive sequence values differ from each other in a single bit position, a number n can be converted into the Gray code value at position n of the sequence simply by taking the exclusive or of n and n/2 (the number formed by shifting n right by a single bit position). Prefix sum_sentence_67

The reverse operation, decoding a Gray-coded value x into a binary number, is more complicated, but can be expressed as the prefix sum of the bits of x, where each summation operation within the prefix sum is performed modulo two. Prefix sum_sentence_68

A prefix sum of this type may be performed efficiently using the bitwise Boolean operations available on modern computers, by computing the exclusive or of x with each of the numbers formed by shifting x to the left by a number of bits that is a power of two. Prefix sum_sentence_69

Parallel prefix (using multiplication as the underlying associative operation) can also be used to build fast algorithms for parallel polynomial interpolation. Prefix sum_sentence_70

In particular, it can be used to compute the divided difference coefficients of the Newton form of the interpolation polynomial. Prefix sum_sentence_71

This prefix based approach can also be used to obtain the generalized divided differences for (confluent) Hermite interpolation as well as for parallel algorithms for Vandermonde systems. Prefix sum_sentence_72