## **Optimal Prefix Circuits with Fan-out 2** Yen-Chun Lin Dept. of Electronic Engineering, National Taiwan Institute of Technology P.O. Box 90-100, Taipei 106, Taiwan ### Abstract Given n values $v_1, v_2, ..., v_n$ and an associative binary operation, denoted by o, the prefix problem is to compute the n prefixes $v_1 \circ v_2 \circ ... \circ v_i$ , $1 \le i \le n$ . Because of the importance of prefix computation, many combinational circuits for solving the prefix problem, called prefix circuits, have been designed. It has been proved that the size s(n) and the depth d(n) of an n-input prefix circuit G(n) satisfy the inequality $d(n) + s(n) \ge$ 2n-2; thus, a prefix circuit is optimal if d(n) + s(n) =2n-2. For the first time in this paper, we present a systematic method to construct optimal parallel prefix circuits with fan-out 2. We also solve an open problem by building a class of optimal prefix circuits with fan-out 2 whose depth can be any integer between n-1 and $2 \lfloor \lg n \rfloor - 1$ , or $2 \lfloor \lg n \rfloor$ depending on the value of The optimal prefix circuits have corresponding optimal prefix algorithms running on a fully connected message-passing multicomputer. ### 1. Introduction Given n values $v_1, v_2, ..., v_n$ and an associative binary operation, denoted by o, the prefix computation problem, or simply the prefix problem, is to compute the n prefixes $v_1$ o $v_2$ o ... o $v_i$ , $1 \le i \le n$ . Prefix computation has been extensively studied for its broad applications [2, 6, 11, 13, 16]; for example, it is used in loop parallelization, the evaluation of polynomials, the solution of linear recurrences, and list ranking. Because of the importance of prefix computation, many combinational circuits for solving the prefix problem, called prefix circuits, have been designed and studied [3, 5, 8, 12-14, 18]. An n-input prefix circuit can be regarded as a directed acyclic graph G(n) containing operation nodes and duplication nodes; unless otherwise stated, we assume n is the number of inputs and n need not be a power of two. As shown in Fig. 1, the operation node takes two inputs to perform the operation o. The duplication node, also depicted in Fig. 1, takes an input to produce multiple, actually only two in this paper, copies of the input as output. As shown in the figure, an operation node is represented by a black dot, while a duplication node is denoted by a small circle. The size s(n) of G(n) is defined to be the number of operation nodes in G(n). The depth d(n) of G(n) is the maximum number of operation nodes on any directed path. It has been proved that for all G(n), d(n) + $s(n) \ge 2n - 2$ [18]. Thus, G(n) is depth-size optimal, or simply optimal, if d(n) + s(n) = 2n - 2. A node has unbounded fan-out if the fan-out is not fixed and is a function of n. The fan-out of a circuit is the maximum of the fan-out of all nodes in the circuit; thus, a circuit has fan-out 2 if every node has at most fan-out 2, and a circuit has unbounded fan-out if one of its nodes has unbounded fan-out. Because a node having more fan-out is slower and bigger [21], for a circuit to be of practical use, the fan-out should be bounded and as small as possible. Fig. 1. Operation node and duplication node. No systematic approach has been known for designing optimal parallel prefix circuits with bounded fan-out. Ladner and Fischer [12] introduce the depth-size trade-off in parallel prefix circuits; that is, the size can be decreased by increasing the depth. They also design non-optimal prefix circuits of minimum depth $d(n) = \lceil \lg n \rceil$ and s(n) < 4n; the circuits have unbounded fan-out and thus are not practical. Brent and Kung [3] present a non-optimal parallel prefix circuit with fan-out 2, for n equal to a power of two; it has $d(n) = 2 \lg n - 1$ . Fich [8] derives lower and upper bounds for the size of parallel prefix circuits when n is a power of two. Snir [18] not only proves that $d(n) + s(n) \ge 2n - 2$ holds for all prefix circuits but also constructs a class of optimal prefix circuits with unbounded fan-out for any d(n) in the range max( $\lceil \lg n \rceil$ , $2 \lceil \lg n \rceil - 2$ ) $\le d(n) \le n - 1$ . Lakshmivarahan, Yang, and Dhall [14] design a class of optimal parallel prefix circuits with unbounded fan-out and $\lceil \lg n \rceil \le d(n) \le \max(\lceil \lg n \rceil, 2 \lceil \lg n \rceil - 3)$ . For the first time in this paper, we present a systematic, recursive method to construct optimal parallel prefix circuits with fan-out 2. We also solve an open problem by building a class of optimal prefix circuits with fan-out 2 whose depth can be any integer between n-1 and $2 \lfloor \lg n \rfloor -1$ , or $2 \lfloor \lg n \rfloor$ depending on the value of n The optimal prefix circuits with fan-out 2 can be mapped to optimal prefix algorithms running on a fully connected message-passing multicomputer of which each PE can only send or receive a message to or from any other PE in a communication step. The prefix algorithms are optimal because the sum of the number of communication steps and the number of messages equals 2n-2. This new computer model will be called the send-or-receive model. The send-or-receive model of communication is very important [10]. Because it is the weakest communication model on a fully connected multicomputer, upper bounds of this model apply to other communication models, and lower bounds of this model are upper bounds for lower bounds of other models. Algorithms developed for this model are suited to implementation on leading-edge message-passing multicomputers, such as the IBM Scalable POWERparallel 2 (SP2) [1], nCUNE 2 [15], and CM-5 [20]. Although a PE of a modern message-passing parallel computer can send a message to a directly connected PE and, in the same step, receive a message from another directly connected PE, it usually takes longer to send and receive in a step than to send only or receive only due to the inherent hardware capability and software overhead [9]. Similarly, while the SP2 can be programmed as fully connected computers, the multistage interconnection may make it take longer to pass more messages in a communication step [19]. On an n-PE system, the send-or-receive model insures that no more than $\lfloor n/2 \rfloor$ messages are communicated in a communication step and thus a communication step will not take too much time. The fully connected model has some advantages in algorithm development [4]. First, it adds portability to algorithms on computers that can dynamically allocate PEs and that can tolerate PE faults. In addition, designing algorithms for the fully connected model need not consider the tedious details of routing messages, insights may be gained for developing algorithms on computers with a particular interconnection among PEs. Further, if the communication requirement of an algorithm for the fully connected model can be efficiently mapped to another target model, that algorithm can be transformed into a new one for the desired model. Many parallel prefix algorithms have also been presented for running on various parallel computer models. Of all parallel computer models, the concurrent-read concurrent-write (CRCW) parallel random-access machine (PRAM) model is the strongest. When the initial values $v_i$ , $1 \le i \le n$ , are O(log n)-bit numbers, Cole and Vishkin solve the prefix sums problem, in which the associative binary operation is the arithmetic addition, on a CRCW PRAM in O(log n/log log n) time using n log log n/log n PEs [6]. Note that the prefix sums problem is a special case of the prefix problem we solve; the two problems are not quite the same. On an exclusive-read exclusive-write (EREW) PRAM, as well as on a CRCW PRAM, Ladner and Fischer's algorithm can solve the prefix problem in $\Theta(\log n)$ time using $\Theta(n/\log n)$ PEs [12]; the time complexity serves as a lower bound for all models. Furthermore, on an EREW PRAM with p < n PEs, prefix computation takes $\Theta(n/p + \log p)$ time [17], which is time-optimal and cost-optimal when $p = \Theta(n/\log n)$ . On an *n*-PE send-or-receive model, Lin and Lin design a family of prefix algorithms that take $\Theta(\log n)$ time, thus are time-optimal [22-23]. With only p < n PEs, Lin and Lin also present a prefix algorithm that takes $\Theta(n/p + \log p)$ time and is time-optimal and cost-optimal when $p = \Theta(n/\log n)$ [22]. Section 2 presents the recursive method to construct optimal parallel prefix circuits with fan-out 2. Section 3 solves an open problem by constructing optimal prefix circuits with fan-out 2 and greater depths. Section 4 shows how an optimal prefix algorithm running on the send-or-receive model can be derived directly from an optimal prefix circuit with fan-out 2. Section 5 concludes this paper. ### 2. Optimal prefix circuits with fan-out 2 This section constructs a class of optimal prefix circuits with fan-out 2, and we will use L(n) to denote an *n*-input circuit in this class. L(2) and L(3) are shown in Fig. 2, which are in fact serial circuits. For ease of presentation, i:j is used to represent the result of computing $v_i$ o $v_{i+1}$ o ... o $v_i$ , where $i \le j$ . Note that as shown in Fig. 2, 1:2, for example, on the right-hand side of an operation node denotes that the node produces 1:2. Fig. 2. Prefix circuits L(2) and L(3). Given L(2) and L(3), the prefix circuit L(n), where n > 3, can be constructed recursively as depicted in Figs. 3 and 4 for n odd and n even, respectively. Specifically, when n is odd and n > 3, the first level of L(n) contains $\lfloor n/2 \rfloor$ operation nodes; the ith node from the left computes (2i-1):2i. The outputs of these $\lfloor n/2 \rfloor$ operation nodes are fed to the circuit $L(\lfloor n/2 \rfloor)$ to produce 1:4, 1:6, ..., 1:(n-1). At the last level of L(n), $\lfloor n/2 \rfloor$ operation nodes are used to generate 1:3, 1:5, ..., 1:n. When n is even and n > 3, the first level of L(n) contains n/2 - 1 operation nodes; the ith node from the left also computes (2i - 1):2i. The outputs of these n/2 - 1 operation nodes and the (n-1)th input of L(n) are fed to the circuit L(n/2) to produce 1:4, 1:6, ..., 1:(n-4), 1:(n-2), and 1:(n-1). At the last level of L(n), n/2 - 1 operation nodes are used to generate 1:3, 1:5, ..., 1:(n-5), 1:(n-3), and 1:n. To illustrate the construction, L(4), L(5), and L(6) are shown in Fig. 5. Note that L(4), like L(2) and L(3), is also a serial circuit; however, for n > 4, L(n) is no more serial. Fig. 3. Prefix circuit L(n), where n is odd and n > 3. Fig. 4. Prefix circuit L(n), where n is even and n > 3. Fig. 5. Prefix circuits L(4), L(5), and L(6). **Theorem 1.** L(n) is optimal; that is, d(n) + s(n) = 2n - 2. **Proof:** We shall prove by induction on n. From Fig. 2, d(2) = s(2) = 1, and d(3) = s(3) = 2. So, the claim holds for n = 2 and n = 3. Assume that the claim holds for $n \le k$ , where $k \ge 3$ , we show that the claim also holds for n = k + 1. From Figs. 3 and 4, we have $$d(n) = d(\lfloor n/2 \rfloor) + 2, \qquad \text{if } n \ge 4; \tag{1}$$ $$s(n) = s(\lfloor n/2 \rfloor) + n - 1,$$ if $n \ge 5$ and $n$ odd; (2) $$s(n) = s(n/2) + n - 2,$$ if $n \ge 4$ and $n$ even. (3) We have two cases to consider. Case (i) $n = k + 1 \ge 5$ and n is odd. From Eqs. (1) and (2), $$d(n) + s(n) = d(k+1) + s(k+1)$$ $$= d(\lfloor (k+1)/2 \rfloor) + 2$$ $$+ s(\lfloor (k+1)/2 \rfloor) + (k+1) - 1$$ $$= d(\lfloor (k+1)/2 \rfloor) + s(\lfloor (k+1)/2 \rfloor)$$ $$+ k + 2$$ $$= 2(\lfloor (k+1)/2 \rfloor) - 2 + k + 2$$ $$= k - 2 + k + 2$$ $$= 2n - 2.$$ Case (ii) $n = k + 1 \ge 4$ and n is even. From Eqs. (1) and (3), $$d(n) + s(n) = d(k+1) + s(k+1)$$ $$= d((k+1)/2) + 2 + s((k+1)/2)$$ $$+ (k+1) - 2$$ $$= d((k+1)/2) + s((k+1)/2)$$ $$+ k+1$$ $$= 2((k+1)/2) - 2 + k + 1$$ $$= 2n - 2.$$ Q.E.D. **Theorem 2.** For L(n), $d(n) = 2 \lfloor \lg n \rfloor - 1, \quad \text{if } 2^i \le n < 3 \times 2^{i-1} \text{ and } i \ge 1;$ $d(n) = 2 \lfloor \lg n \rfloor, \quad \text{if } 3 \times 2^{i-1} \le n < 2^{i+1} \text{ and } i \ge 1.$ **Proof:** We will use the following equation [7]: for any integer n and integer $k \neq 0$ , $$\lfloor \lfloor n/k \rfloor / 2 \rfloor = \lfloor n/2k \rfloor$$ . In addition, we will also use d(2) = 1 and d(3) = 2, which follow from Fig. 2. Clearly, the claim holds for n = 2 and 3. Let $2^i \le n < 2^{i+1}$ , where $i \ge 2$ . From Figs. 3 and 4, $$d(n) = d(\lfloor n/2 \rfloor) + 2$$ $$= d(\lfloor n/4 \rfloor) + 4$$ $$= d(\lfloor n/2 \rfloor + 2(i-1))$$ Thus, when $2^{i} \le n < 3 \times 2^{i-1}$ and $i \ge 2$ , $$d(n) = d(2) + 2(i - 1)$$ = 2i - 1 = 2 \left[ \left] g \ n \right] - 1. On the other hand, when $3 \times 2^{i-1} \le n < 2^{i+1}$ and $i \ge n$ 2, $$d(n) = d(3) + 2(i - 1) = 2i = 2 \lfloor \lg n \rfloor$$ . Q.E.D. Note that when n is a power of two, the depth of L(n) equals the depth of the n-input Brent-Kung circuit, BK(n), mentioned earlier. If we generalize BK(n) to allow n not to be a power of two as presented in [13], then the derivation of d(n) in Theorem 2 also holds for the depth of BK(n). That is, the depth of L(n) equals the depth of BK(n) for any $n \ge 2$ ; however, the size of L(n) is smaller than that of BK(n) for $n \ge 4$ , which makes L(n) optimal and BK(n) non-optimal. # 3. Optimal prefix circuits with greater depth This section solves an open problem about the existence of a class of optimal prefix circuits with bounded fan-out [18]. These circuits are constructed by using L(n) and a property of prefix circuits combined from two smaller circuits described below. Given two prefix circuits $G_1(n)$ and $G_2(m)$ with nand m inputs, respectively, we can combine $G_1(n)$ and $G_2(m)$ , as depicted in Fig. 6, to obtain a prefix circuit $G(n + m - 1) = G_1(n) \cdot G_2(m)$ with n + m - 1inputs. If $G_1(n)$ and $G_1(m)$ are optimal, then $G_1(n) \cdot G_2(m)$ is also optimal [18]. Let $d_1$ and $d_2$ be the depths of $G_1(n)$ and $G_1(m)$ , respectively. Clearly, if $G_1(n)$ generates 1:n at the last level, or level $d_1$ , then the depth of $G_1(n) \cdot G_2(m)$ equals $d_1 + d_2$ . This property will be used in the proof of Theorem 3 to derive a class of optimal prefix circuits with fan-out 2 and with greater depths than the depth of L(n). For ease of presentation, we define an n-input prefix circuit to be persistent if it generates 1:n at the last level. As can be seen from Figs. 2, 3, and 4, all L(n) circuits are persistent. Fig. 6. Combination of $G_1(n)$ and $G_2(m)$ . **Theorem 3.** For any $n \ge 2$ and any t in the range Theorem 3. 15. $2 \lfloor \lg n \rfloor - 1 \le t \le n - 1$ , if $2^{i} \le n < 3 \times 2^{i-1}$ and $i \ge 1$ ; $2\lfloor \lg n \rfloor \le t \le n-1$ , if $3 \times 2^{i-1} \le n < 2^{i+1}$ and $i \ge 1$ , there exists an *n*-input optimal, persistent prefix circuit with fan-out 2 and depth t. **Proof:** We shall prove by induction on n. The theorem can be checked for $n \le 5$ by examining the L(2)and L(3) in Fig. 2, the L(4) and L(5) in Fig. 5, and the 5-input serial prefix circuit. Let n > 5 and assume the theorem is true for all n' < n. Consider the following two cases. Case (i) $$3 \times 2^{i-1} \le n < 2^{i+1}$$ and $i \ge 2$ . If $$n = 3 \times 2^{i-1} = 1.5 \times 2^i$$ then $$2^{i} < n - 1 < 3 \times 2^{i-1}$$ ; from the induction hypothesis, there exists an (n-1)input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2 \lfloor \lg (n-1) \rfloor - 1 \le t \le (n-1) - 1,$$ i.e., $$2\lfloor \lg n \rfloor - 1 \le t \le n - 2.$$ By combining such a circuit with L(2), we can obtain an n-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2 \lfloor \lg n \rfloor \le t \le n-1$$ . $2 \lfloor \lg n \rfloor \le t \le n-1.$ On the other hand, if $3 \times 2^{i-1} < n < 2^{i+1},$ $$3 \times 2^{i-1} < n < 2^{i+1}$$ then $$3 \times 2^{i-1} \le n-1 < 2^{i+1}$$ ; from the induction hypothesis, there exists an (n-1)input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg (n-1)\rfloor \le t \le (n-1)-1$$ , i.e., $$2\lfloor \lg n \rfloor \le t \le n-2$$ . By combining such a circuit with L(2), we can obtain an *n*-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg n \rfloor + 1 \le t \le n-1.$$ Furthermore, from Theorem 2, the depth of L(n) is $2\lfloor \lg n \rfloor$ when $3 \times 2^{i-1} < n < 2^{i+1}$ . Therefore, there exists an n-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg n\rfloor \le t \le n-1.$$ Case (ii) $$2^i \le n < 3 \times 2^{i-1}$$ and $i \ge 3$ . $$n=2^i$$ , then $$3 \times 2^{i-2} < n-1 < 2^i$$ ; from the induction hypothesis, there exists an (n-1)input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg (n-1)\rfloor \le t \le (n-1)-1$$ , i.e., $$2\lfloor \lg n \rfloor - 2 \le t \le n-2$$ . By combining such a circuit with L(2), we can obtain an n-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg n \rfloor - 1 \le t \le n - 1.$$ On the other hand, if $$2^{i} < n < 3 \times 2^{i-1}$$ then $$2^{i} \le n-1 < 3 \times 2^{i-1}$$ ; from the induction hypothesis, there exists an (n-1)input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2 \lfloor \lg (n-1) \rfloor - 1 \le t \le (n-1) - 1$$ , i.e., $$2\lfloor \lg n \rfloor - 1 \le t \le n - 2.$$ By combining such a circuit with L(2), we can obtain an n-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2\lfloor \lg n \rfloor \le t \le n-1.$$ Furthermore, from Theorem 2, the depth of L(n) is $2 \lfloor \lg n \rfloor - 1$ when $2^i < n < 3 \times 2^{i-1}$ . Therefore, there exists an n-input optimal, persistent prefix circuit with fan-out 2 and depth t in the range $$2 \lfloor \lg n \rfloor - 1 \le t \le n - 1.$$ Q.E.D. ## 4. Optimal algorithms on the send-orreceive model In this section, we show how an optimal prefix circuit with fan-out 2 can be mapped to an optimal prefix algorithm running on the send-or-receive model. Every operation node in Figs. 2-5 that computes i:j on line j by performing (i:k) o (k+1:j) can be interpreted as a communication operation and a computation on the sendor-receive model: transferring i:k from PE k to PE j, which contains k+1:j before the transfer, and performing (i:k) o (k+1:j) by PE j, where $1 \le i \le k < j \le n$ . Thus, each of the optimal prefix circuits with fan-out 2 in this paper corresponds to a prefix algorithm running on the send-or-receive model. It should be noted that the fan-out 2 of an operation node or duplication node on line i of a prefix circuit insures that PE i of the send-or-receive model can send out at most a message in a step. Further, as an operation node on line i of a prefix circuit always has fan-in 2, it means that the corresponding PE i of the send-or-receive model can receive at most a message in a step. The depth of a prefix circuit corresponds to the number of parallel communication steps needed on the send-orreceive model, while the size of a prefix circuit corresponds to the total number of messages transferred. Carrying over the inequality $d(n) + s(n) \ge 2n - 2$ from the circuit model to the send-or-receive model, the sum of the number of communication steps and the number of messages is greater than or equal to 2n - 2 [18]; the prefix algorithms running on the send-or-receive model are optimal because the sum of the number of communication steps and the number of messages equals 2n - 2. ### 5. Conclusion We have presented a systematic approach to constructing optimal parallel prefix circuits with fan-out 2. We have also solved an open problem by building optimal n-input prefix circuits with fan-out 2 and depth t in the range $2 \lfloor \lg n \rfloor - 1 \leq t \leq n - 1$ , if $2^i \leq n < 3 \times 2^{i-1}$ and $i \geq 1$ ; or depth t in the range $2 \lfloor \lg n \rfloor \leq t \leq n - 1$ , if $3 \times 2^{i-1} \leq n < 2^{i+1}$ and $i \geq 1$ . Further, it has been shown that an optimal prefix circuit with fanout 2 corresponds to an optimal prefix algorithm running on the send-or-receive fully connected multicomputer. ### Acknowledgment This research was supported in part by the National Science Council of the R.O.C. under contracts NSC85-2213-E-011-017 and NSC86-2213-E-011-025. The author thanks C.T. Howard Ho for insightful discussion on the prefix computation that helped produce new results. ### References - [1] T. Agerwala, et al., "SP2 system architecture," IBM Syst. J., vol. 34, pp. 152-184, 1995. - [2] S.G. Akl, The Design and Analysis of Parallel Algorithms. Englewood Cliffs, NJ: Prentice-Hall, 1989. - [3] R.P. Brent and H.T. Kung, "A regular layout for parallel adders," *IEEE Trans. Comput.*, vol. C-31, pp. 260-264, Mar. 1982. - [4] J. Bruck, et al., Efficient algorithms for the index operation in message-passing systems, Research Report, RJ 9300, IBM, Apr. 1993. - [5] D.A. Carlson and B. Sugla, "Limited width parallel prefix circuits," *J. Supercomput.*, vol. 4, pp. 107-129, June 1990. - [6] R. Cole and U. Vishkin, "Faster optimal parallel prefix sums and list ranking," *Infom. Contr.*, vol. 81, pp. 334-352, 1989. - [7] T.H. Cormen, C.E. Leiserson, and R.L. Rivest, *Introduction to Algorithms*. Cambridge, MA: MIT Press, 1990. - [8] F.E. Fich, "New bounds for parallel prefix circuits," in *Proc. 15th Symp. on the Theory of Computing*, 1983, pp. 100-109. - [9] The Transputer Databook, 3rd ed. Almondsbury, Bristol, UK: Inmos, 1992. - [10] D.W. Krumme, G. Cybenko, and K.N. Venkataraman, "Gossiping in minimal time," SIAM J. Comput., vol. 21, pp. 111-139, Feb. 1992. - [11] C.P. Kruskal, T. Madej, and L. Rudolph, "Parallel prefix on fully connected direct connection machines," in *Proc. Int. Conf. on Parallel Processing*, 1986, pp. 278-284. - [12] R.E. Ladner and M.J. Fischer, "Parallel prefix computation," *J. ACM*, vol. 27, pp. 831-838, Oct. 1980. - [13] S. Lakshmivarahan and S.K. Dhall, *Parallel Computing Using the Prefix Problem*. Oxford, UK: Oxford University Press, 1994. - [14] S. Lakshmivarahan, C.M. Yang, and S.K. Dhall, "On a new class of optimal parallel prefix circuits with (size + depth) = 2n 2 and \[ \log n \] \] \( \leq \text{depth} \) \( \leq \left \] \( \log \text{n} \] \( \log \text{n} \] \( \log \text{n} \)," in \( \text{Proc. 1987 Int. Conf. on Parallel Processing, 1987, pp. 58-65.} \) - [15] nCUBE 2 Supercomputers: Systems. Beaverton, Oregon: nCUBE, 1990. - [16] A. Nicolau and H. Wang, "Optimal schedule for parallel prefix computation with bounded resources," in *Proc. Third ACM SIGPLAN Symp. on Principles & Practice Parallel Programming*, 1991, pp. 1-10. - [17] M.J. Quinn, Parallel Computing: Theory and Practice. New York: McGraw-Hill, 1994. - [18] M. Snir, "Depth-size trade-offs for parallel prefix computation," *J. Algorithms*, vol. 7, pp. 185-201, 1986. - [19] C.B. Stunkel, et al., "The SP2 High-Performance Switch," *IBM Syst. J.*, vol. 34, pp. 185-204, 1995. - [20] Connection Machine CM-5 Technical Summary: Thinking Machines Corporation, 1991. - [21] N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A System Perspective, 2nd ed. Reading, MA: Addison-Wesley, 1993. - [22] Y.C. Lin and C.M. Lin, "Efficient parallel prefix algorithms on fully connected message-passing computers," to appear in *Proc. 3rd Int. Conf. on High Performance Computing*, Dec. 1996. - [23] Y.C. Lin and C.M. Lin, "A family of efficient algorithms for the prefix operation on message-passing computers," in L.N.C.S. #1067, Proc. Int. Conf. and Exhibition on High-Performance Computing and Networking, 1996, pp. 1003-1004.