# New Modular Construction of Low-Complexity bit-parallel Systolic Multipliers for a class of Finite Fields GF( $\mathbf{2}^{\mathbf{m}}$ ) 

Chiou-Yng Lee and Erl-Huei Lu<br>Chung Gung University, Taiwan, R.O.C.<br>Lchiou@ms.chttl.com.tw


#### Abstract

An efficient design for low-complexity and fast computation for the bit-parallel systolic architecture is of practical concern in many digital circuit designs. This paper presents a class of novel bit-parallel systolic multiplier over the finite field $\operatorname{GF}\left(2^{m}\right)$, which is generated from the irreducible all one polynomial (AOP) and equally spaced polynomial (ESP). The proposed architectures have properties of highly regularity, simplicity, and shorter latency, which are important in designing the bit-parallel systolic multipliers. Moreover, the AOP-based systolic multipliers of small fields can be used to construct all the corresponding ESP-based systolic multipliers of large fields. The latency of the AOP-based and ESP-based systolic multipliers require $\mathrm{m}+2$ and $\mathrm{m}+\mathrm{r}+1$ clock cycles, respectively, which ae better than others. The size complexity of the proposed multipliers is smaller than previously developed multipliers of the same class. And as for the parallel systolic multipliers, the bit-parallel structures used in this paper has shorter the computation latency


## I. Introduction

Efficient algorithm of real-time system, high-speed and low-complexity of fast computation over finite field $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ is an extremely important research topic owing to their applications in the areas of computers and communications, e.g., error-control-correcting [10],[14] and cryptography [9],[12],[13]. Significant arithmetic operations for these applications are addition, multiplication, inversion/division. However, multiplication and inversion/division which is proposed by successive multiplication are still complex circuits. Therefore, it is important to introduce an efficient multiplication algorithm for constructing a bit-parallel multiplier of low-complexity for arithmetic circuits. Thus, the bit-parallel systolic architecture is of course the hot topic for us to pursuit.

It is important that the Massey-Omura multiplier (MOM) in [17] is the first modular parallel architectures, which requires the circuit complexity of $\mathrm{O}\left(\mathrm{m}^{3}\right)$ AND gates and $\mathrm{O}\left(\mathrm{m}^{3}\right)$ XOR gates. To reduce the time and size complexities, Itoh and Tsujii in 1989 [6], based on special classes of finite fields such as all one polynomial (AOP) and equally spaced polynomial (ESP), proposed the bit-parallel multipliers. If the irreducible polynomial is an AOP, then only $2 \mathrm{~m}^{2}-2 \mathrm{~m}$ XOR and $\mathrm{m}^{2}$ AND gates are required for the parallel multiplier. Their structure is a modular architecture and has a lower size complexity compared to MOM. Besides, they also extend their multiplication algorithm to the irreducible ESP' s. Later,

Hasan (1992) [5],[4] used the AOP-based multipliers of small size to construct the ESP-based multipliers of large size. Recently, Koc and Sunor (1998) [3] designed multipliers of the low-complexity bit-parallel with canonical basis and normal basis. And meanwhile, Wu and Hasan (1998) [7],[8] presented another low-complexity parallel multipliers employing the weekly dual basis (WDB). Moreover, from the complexity point of view, Drelot (1998) [16] confirmed that irreducible AOP and ESP have smaller complexity arithmetic circuits. The two polynomials based on an isomorphism can be transformed from $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ into the residue polynomial ring modulo $\mathrm{x}^{\mathrm{n}}+1$. If the polynomial is irreducible AOP of degree $m$, then $n=m+1$. The design mentioned above were at the design of modular architectures, however, and their circuits can not be realized to use the systolic architecture.

To optimize finite-field arithmetic circuit design three criteria have to be considered: 1) short computation delay (latency); 2) less circuit comple xity; 3) short clock period (cyclic time). The latency of systolic circuit is defined as the time it takes for an element from the input of a stage to its output. As low-complexity and high-speed computation becomes increasingly attractive, the systolic architectures in the VLSI are a common good choice. Due to the architectures possess concurrent, simple and regular designs that are balanced with I/O. Recently, numerous several of hardwares and algorithms, based on serial and parallel manners, have been proposed for comp uting arithmetic operations in $\operatorname{GF}\left(2^{\mathrm{m}}\right)$, which can be implemented in the systolic architectures [1-2],[15],[18]. The systolic multipliers by bit-serial manners have been introduced [18]; furthermore, the parallel-in-parallel-out systolic multipliers have been proposed in [1-2]. In 1984, Yeh [2] produced the parallel systolic multiplier. Its basic cell contains two AND gates, one 3-input XOR gates, and seven latches. Wei(1994) [1] also produced a power-sum systolic multiplier for computing $\mathrm{AB}^{2}+\mathrm{C}$, where $\mathrm{A}, \mathrm{B}$, and C are any element in $\mathrm{GF}\left(2^{\mathrm{m}}\right)$. However, the latency of the exited parallel systolic multipliers still required 3 m clock cycles.

For the need of low-complexity circuit with minimized latency, this work presents a new bit-parallel systolic architectures to compute the element multiplication over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$. The new circuit is an alternative design in canonical basis over the field $\operatorname{GF}\left(2^{m}\right)$ generated by irreducible AOP and ESP. The novel AOP-based systolic multiplier applies the proposed multiplication schemes to construct a low-complexity and fast computation with the bit-parallel architectures. The designed multipliers are more efficient for the element multiplication in $\operatorname{GF}\left(2^{m}\right)$, as they simplify the architecture and increase computation speed. In addition, applying the AOP-based systolic multiplier of small
fields can construct the ESP-based modular systolic multiplier of large fields. The latency complexity of the developed AOP-based systolic multiplier is more efficient in reducing the clock cycles from 3 m to $\mathrm{m}+2$.

## II. Proposed AOP-based modular systolic multiplier

It is assumed that the reader is familiar with the basic concepts of finite field. The properties of finite fields $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ are covered in detail in [11].
Definition 1 [6]: A polynomial $\mathrm{p}(\mathrm{x})=\sum_{i=0}^{m} p_{i} x^{i}$ over $\mathrm{GF}(2)$ of degree m is called all one polynomial (AOP) iff $p_{i}=1,0 \leq i \leq m$.

An AOP has an important property of $\mathrm{p}(\mathrm{x}) \mid \mathrm{x}^{\mathrm{m}+1}+1$. This variety of polynomial is an irreducible iff $m+1$ is a prime and 2 is a primitive modulo $m+1$. For example, the possible AOP of degree m to become irreducible are specified by irreducible polynomials, such as $\mathrm{m}=2,4,10$, $12,18,28,36,52,58,60,66,82,100$, for $\mathrm{m} \leq 100$. If $\alpha$ is a root of the irreducible $\mathrm{AOP} \mathrm{p}(\mathrm{x})$, then we obtain

$$
\begin{equation*}
\alpha^{\mathrm{m}+1+\mathrm{j}}=\alpha^{\mathrm{j}} \quad(0 \leq \mathrm{j} \leq \mathrm{m}-2 \quad) \tag{1}
\end{equation*}
$$

In order to reduce the modulo operations, the field elements are transformed from $\operatorname{GF}\left(2^{m}\right)$ into the polynomial ring modulo $\mathrm{x}^{\mathrm{m}+1}+1$ because of $\alpha^{m+1}=1$, that is, any element $A=\sum_{i=0}^{m-1} \bar{a}_{i} \alpha^{i} \in \mathrm{GF}\left(2^{\mathrm{m}}\right)$ can also be represented as $A=\sum_{i=0}^{m} a_{i} \alpha^{i}$, where $\bar{a}_{i}=a_{i}+a_{m}$ ( $0 \leq i \leq m-1$ ) [6]. For example, $A=1+\alpha+\alpha^{3} \in G F\left(2^{4}\right)$, the element can be represented as $A=1+\alpha+\alpha^{3}$ by using the canonical representation or $A=\alpha^{2}+\alpha^{4}$ by using the extended representation.

Now, let use consider of the case two extended elements $A=\sum_{i=0}^{m} a_{i} \alpha^{i}$ and $B=\sum_{i=0}^{m} b_{i} \alpha^{i}$ over $\operatorname{GF}\left(2^{\mathrm{m}}\right)$, it is observably that he multiplication of two elements A and B equals to $A B\left(\bmod \alpha^{m+1}+1\right)$. In the following subsection, this type of element representation will be used to develop the multiplication algorithm for designing bit-parallel systolic multipliers.

## A. Algorithm

Since $m+1$ is a prime and 2 is a primitive modulo $m+1$, we obtain $2^{m-1}=(m+2) / 2 \bmod (m+1)$. So $j 22^{m-1} \bmod$ $(\mathrm{m}+1)$ is a permutation $\pi$ on $\{0,1,2, \ldots, \mathrm{~m}\}$, i.e.,

$$
\begin{align*}
\pi(j) & =j \cdot 2^{m-1} \quad \bmod (m+1)  \tag{2}\\
& =j(m+2) / 2 \quad \bmod (m+1)
\end{align*}
$$

According to (2), we immediately obtain the following properties.
Property 1: $2 \pi(j)=j$

Property 2: $\pi(i \pm j)=\pi(i)+\pi(j)$
Property 3: $\pi(m+1)=0$
Applying the Property $1-3$, the element A may be re-expressed by shuffling its terms as follows

$$
\begin{equation*}
A=\sum_{i=0}^{m} a_{\pi(i)} \alpha^{\pi(i)} \tag{3}
\end{equation*}
$$

Therefore, common multiplication results: in both types of multiplication is the multiply-by- $\alpha^{\pi(1)}$ operation, which can be done by the following rule, i.e., let

$$
\begin{equation*}
A^{(1)}=\sum_{i=0}^{m} a_{<\pi(i)-\pi(1)\rangle} \alpha^{\pi(i)} \tag{4}
\end{equation*}
$$

Then,

$$
\begin{align*}
\mathrm{A}^{\pi(1)} & =\mathrm{a}_{\pi(0)} \alpha^{\pi(0)+\pi(1)}+\mathrm{a}_{\pi(1)} \alpha^{\pi(1)+\pi(1)} \\
& +\ldots+\mathrm{a}_{\pi(\mathrm{m})} \alpha^{\pi(\mathrm{m})+\pi(1)} \\
& =\mathrm{a}_{\pi(\mathrm{m})} \alpha^{\pi(0)}+\mathrm{a}_{\pi(0)} \alpha^{\pi(1)}+\ldots+\mathrm{a}_{\pi(\mathrm{m}-1)} \alpha^{\pi(\mathrm{m})} \\
= & \mathrm{a}_{\langle\pi(0)-\pi(1)\rangle}+\mathrm{a}_{<\pi(1)-\pi(1)\rangle} \alpha^{\pi(1)}+\ldots+\mathrm{a}_{\langle\pi(\mathrm{m})-\pi(1)\rangle} \alpha^{\pi(\mathrm{m})}  \tag{5}\\
= & \mathrm{A}^{(1)}
\end{align*}
$$

where $\langle\mathrm{x}\rangle$ is denoted by x modulo $\mathrm{m}+1$. A straightforward multiply-by- $\alpha^{\pi(1)}$ operation is equivalent to shift-right-by-1-bit operation. From (5), we can define cyclic shift-right-by-j-bit operations, i.e.,

$$
\begin{equation*}
A^{(j)}=\sum_{i=0}^{m} a_{\pi(i-j)} \alpha^{\pi(i)} \tag{6}
\end{equation*}
$$

Similarly, $\mathrm{A}^{(\mathrm{j})}$ is equivalent to cyclically shifting j bit to the left, such as

$$
\begin{equation*}
A^{(-j)}=\sum_{i=0}^{m} a_{\pi(i+j)} \alpha^{\pi^{(i)}} \tag{7}
\end{equation*}
$$

Consider the coefficients of A as they relate to $A^{(j)}$ and $A^{(-j)}$, we therefore obtains

$$
\begin{equation*}
A=A^{(-j)} \alpha^{\pi(j)}=A^{(j)} \alpha^{\pi(-j)} \tag{8}
\end{equation*}
$$

Definition 2: Given $A=\sum_{i=0}^{m} a_{\pi(i)} \alpha^{\pi(i)} \quad$ and $B=\sum_{i=0}^{m} b_{\pi(i)} \alpha^{\pi(i)}$, the inner product of A and B, as denotes $A \Theta B$, can be defined as follows
$A \Theta B=\sum_{i=0}^{m} a_{\pi(i)} b_{\pi(i)} \alpha^{i}$
Definition 3: Let two elements A and B periodically be shifted by j positions to right and left, $A^{(j)}$ and $B^{(-j)}$, respectively. Then, based on Definition 1, the $\mathrm{j}^{\text {th }}$ inner product, $A^{(j)} \Theta B^{(-j)}$, is defined as

$$
\begin{equation*}
A^{(j)} \Theta B^{(-j)}=\sum_{i=0}^{m} a_{\pi(i)-\pi(j)\rangle} b_{\pi(i) \pi \pi(j)>} \alpha^{i} \tag{10}
\end{equation*}
$$

Theorem 1: Given $\quad A=\sum_{i=0}^{m} a_{\pi(i)} \alpha^{\pi(i)} \quad$ and
$B=\sum_{i=0}^{m} b_{\pi(i)} \alpha^{\pi(i)}$, the product of $A$ and $B$ can be represented by the following recursive formula

$$
A B=\sum_{j=0}^{m} A^{(j)} \Theta B^{(-j)}
$$

Proof: Let

$$
\begin{aligned}
& \mathrm{A}=\mathrm{a}_{0}+\mathrm{a}_{1} \alpha+\mathrm{a}_{2} \alpha^{2}+\ldots+\mathrm{a}_{\mathrm{m}} \alpha^{\mathrm{m}} \\
& \mathrm{~B}=\mathrm{b}_{0}+\mathrm{b}_{1} \alpha+\mathrm{b}_{2} \alpha^{2}+\ldots+\mathrm{b}_{\mathrm{m}} \alpha^{\mathrm{m}}
\end{aligned}
$$

Then their circular convolution can be re-expressed by

$$
A B=\sum_{i=0}^{m} \sum_{j=0}^{m} a_{j} b_{<i-j>} \alpha^{i}
$$

From Property 1, we know that $i=2 \pi(i)$, for $\mathrm{i}=0$, $1, \ldots, m$, then

$$
\begin{equation*}
A B=\sum_{i=0}^{m} \sum_{j=0}^{m} a_{j} b_{<2 \pi(i)-j>} \alpha^{2 \pi(i)} \tag{11}
\end{equation*}
$$

Next, choosing j such that $\quad j=\pi(i-j)$, for $\mathrm{j}=0,1, \ldots$, m . Therefore, AB can be re-expressed by

$$
\begin{align*}
A B & =\sum_{i=0}^{m} \sum_{j=0}^{m} a_{\pi(i-j)} b_{\pi(i+j)} \alpha^{2 \pi(i)} \\
& =\sum_{i=0}^{m} \sum_{j=0}^{m} a_{\pi(i-j)} b_{\pi(i+j)} \alpha^{i}  \tag{12}\\
& =\sum_{j=0}^{m} A^{(j)} \Theta \mathbf{B}^{(-j)}
\end{align*}
$$

Example 1: If $m=4$, then we obtains $m+1=5$ is a prime. By applying the Property $1-3$, we obtains $\pi(i)$ for $0 \leq i \leq 4 \quad$, such as $\pi(0)=0 \quad, \quad \pi(1)=2^{3} \equiv 3$, $\pi(2)=2 \cdot 2^{3} \equiv 1, \pi(3)=3 \cdot 2^{3} \equiv 4$, and $\pi(4)=4 \cdot 2^{3} \equiv 1$. Assume that $\left\{1, \alpha, \alpha^{2}, \alpha^{3}, \alpha^{4}\right\}$ is an extended basis of the field $\operatorname{GF}\left(2^{4}\right)$, thus, the basis can be transformed into $\left\{\alpha^{\pi(0)}, \alpha^{\pi(1)}, \alpha^{\pi(2)}, \alpha^{\pi(3)}, \alpha^{\pi(4)}\right\}$. Let $A=a_{\pi(0)} \alpha^{\pi(0)}$ $+a_{\pi(1)} \alpha^{\pi(1)}+a_{\pi(2)} \alpha^{\pi(2)}+a_{\pi(3)} \alpha^{\pi(3)}+a_{\pi(4)} \alpha^{\pi(4)} \quad$ and $B=b_{\pi(0)} \alpha^{\pi(0)} \quad+b_{\pi(1)} \alpha^{\pi(1)}+b_{\pi(2)} \alpha^{\pi(2)}+b_{\pi(3)} \alpha^{\pi(3)}+b_{\pi(4)} \alpha^{\pi(4)}$ be two elements of the field $\operatorname{GF}\left(2^{4}\right)$; and let $C=c_{0}+c_{1} \alpha+c_{2} \alpha^{2}+c_{3} \alpha^{3}+c_{4} \alpha^{4}$ be the product of the multiplication A and B . The product C can then be computed by using Theorem 1 , as

$$
\begin{array}{cccccc} 
& a_{\pi(0)} & a_{\pi(1)} & a_{\pi(2)} & a_{\pi(3)} & a_{\pi(4)} \\
X & b_{\pi(0)} & b_{\pi(1)} & b_{\pi(2)} & b_{\pi(3)} & b_{\pi(4)} \\
\hline A \Theta B= & a_{\pi(0)} b_{\pi(0)} & a_{\pi(1)} b_{\pi(1)} & a_{\pi(2)} b_{\pi(2)} & a_{\pi(3)} b_{\pi(3)} & a_{\pi(4)} b_{\pi(4)} \\
A^{(1)} \Theta B^{(-1)}= & a_{\pi(4)} b_{\pi(1)} & a_{\pi(0)} b_{\pi(2)} & a_{\pi(1)} b_{\pi(3)} & a_{\pi(2)} b_{\pi(4)} & a_{\pi(3)} b_{\pi(0)} \\
A^{(1)} \Theta B^{(-2)}= & a_{\pi(3)} b_{\pi(2)} & a_{\pi(4)} b_{\pi(3)} & a_{\pi(0)} b_{\pi(4)} & a_{\pi(1)} b_{\pi(0)} & a_{\pi(2)} b_{\pi(1)} \\
A^{(2)} \Theta B^{(-3)}= & a_{\pi(2)} b_{\pi(3)} & a_{\pi(3)} b_{\pi(4)} & a_{\pi(4))} b_{\pi(0)} & a_{\pi(0)} b_{\pi(1)} & a_{\pi(1)} b_{\pi(2)} \\
A^{(4)} \Theta B^{(-4)}= & a_{\pi(1)} b_{\pi(4)} & a_{\pi(2))} b_{\pi(0)} & a_{\pi(3)} b_{\pi(1)} & a_{\pi(4))} b_{\pi(2)} & a_{\pi(0)} b_{\pi(3)} \\
\hline C= & c_{0} & c_{1} & c_{2} & c_{3} & c_{4}
\end{array}
$$

As stated above, the multiplication scheme is
focused in the extended element to obtain $\mathrm{AB}=\sum_{j=0}^{m} c_{j} \alpha^{j}$, where $c_{j}=\sum_{i=0}^{m} a_{\pi(j)-\pi(i)\rangle} b_{\sigma(j)+\pi(i)>}(\bmod 2)$. In order to obtain completely multiplication scheme, the proposed multiplication in (12) must be to perform the reduced modulo $p(\alpha)$ operation to obtain the desired multiplication of two elements. Therefore, let $A B=\sum_{j=0}^{m-1} \bar{c}_{j} \alpha^{j}$ be the results of AB , the coefficients $\bar{c}_{j}$ can be obtained using the following relationships

$$
\begin{equation*}
\bar{c}_{j}=c_{j}+c_{m} \quad(\bmod 2) \tag{13}
\end{equation*}
$$

## B. Structure and comparison

We call the circuits which realize (12) and (13) as two operation units: the inner product multiplication (IPM) unit and the final reduced modulo $\mathrm{p}(\alpha)$ (FRM) unit, respectively. According to Theorem 1, it is obvious that the IPM unit of Fig. 3 requires $\mathrm{m}+1$ inner-product step procedures (IPSPs). The structure of each IPSP is shown in Fig. 1(a) includes $\mathrm{m}+1$ basic cells. The basic cell is the realization of $\mathrm{c}_{\mathrm{i}}+\mathrm{a}_{2 \pi(\mathrm{i})-\pi(\mathrm{j})}, \mathrm{b}_{\langle\pi(\mathrm{i})+\pi(\mathrm{j})\rangle} \bmod 2$ which includes one 2-input AND gate, one 2-input XOR gate and three 1-bit latches, as shown in Fig. 1(b). Fig. 2 depicts that the structure of FRM unit is operation unit of (13), which includes m 2 -input XOR gate and m 1-bit latches. Fig. 3 illustrates that based on Fig. 1-2, the proposed AOP-based systolic multiplier over $\operatorname{GF}\left(2^{4}\right)$ is comprised of two parts: the IPM unit and the FRM unit.

In the IPM unit of Fig. 3, the $i^{\text {th }}$ column cells denote the order of $\alpha^{i}$. The $j^{\text {th }}$ row cells is identical to the $j^{\text {th }}$ IPSP for $A^{(i)} \Theta B^{(j)}$ operations. Hereafter, the $i^{\text {th }}$ cell of the $\mathrm{j}^{\text {th }}$ IPSP of IPM unit is denoted by the ( $\mathrm{i}, \mathrm{j}$ ) cell. With coefficients $c_{i}, a_{\pi(i-j)}, b_{\pi(i+j)}$ enter the cell $(\mathbf{i}, \mathrm{j})$, the cell operates $c_{i}=c_{i}+a_{\pi(i-j)} b_{\pi(i+j)} \quad(\bmod 2)$ computations. The basic cells consist of one 2 -input AND gate, one 2 -input XOR gate and three 1-bit latches. When the input data of three elements A, B, C enter the array, all coefficients are distributed over the first row cell. Fig. 3 presents that all coefficients in the $\mathrm{j}^{\text {th }}$ IPSP $(0 \leq j \leq 4)$ are als o distributed over the $\mathrm{j}^{\text {th }}$ row cells. As the operations of the $\mathrm{j}^{\text {th }}$ IPSP, the coefficients $a_{\pi(i-j)}$ > and $a_{\pi(i+j)}$ in the cell $(\mathrm{i}, \mathrm{j})$, for $0 \leq i \leq m$, respectively propagate to the cells $(\mathrm{i}+1, \mathrm{j}+1$ ) and ( $\mathrm{i}+1, \mathrm{j}-1$ ). As previously stated, neighborhood communications among cells is performed by transportation of all neighbor coefficients in the array. This instructs us to take advantage of the bit-parallel systolic architectures for the circuit design with which each IPSP only requires one clock cycle.

In the successive computations, the input data can continuously enters the array, and each IPSP only demands one clock cycle to complete the inner-product operations. From Fig. 3, the proposed AOP-based systolic multiplier comprises two parts: the IPM unit and
the FRM unit. The IPM unit consists of m+1 IPSPs, that is, the latency of the IPM unit requires $m+1$ clock cycles. According to Fig. 2, the FRM unit only demands one clock cycle. Therefore, the latency complexity of the proposed multiplier requires only $\mathrm{m}+2$ clock cycles to compute AB for the first input data that enters the planned systolic multiplier. A possible clock period of latency requires a minimum of one 2-input AND gate and one 2-input XOR gate delays, as shown in Fig. 2(b). The total gate complexity in this circuit comprises $(m+1)^{2} 2$-input AND, $(m+1)^{2}+m \quad 2$-input XOR gates and $3(m+1)^{2}+m$ 1-bit latches. Since the operation works every clock cycle and no cycle is wasted, the proposed architecture yields the maximum possible throughput. Therefore, this architecture is highly regular and simple in structure, and has a shorter latency to perform the element multiplication.

There are several points to be addressed. The latency of the systolic architecture for multiplications over $\operatorname{GF}\left(2^{m}\right)$ is only $m+1$ clock cycles while most other bit-parallel systolic multipliers, such as these in [1] and [2], require 3 m . Table 1 reveals that our AOP-based multipliers require more logic circuit than the two low-complexity design but they are much simple than Wei's and Yeh's multipliers. The propagation delay of each cell is short being the total delay of one 2input AND gate, one 2-input XOR gate and one 1-bit latch, and the multiplier generates a product in each clock cycle. The throughput is therefore very high. Finally, this architecture is highly regular, simple and with very few global connections.

## III. Proposed ESP-based bit-parallel modular systolic multiplier

Definition 4[6]: A polynomial $g(x)=x^{m r}+x^{r(m-1)}+\cdots+$ $x^{r}+1=p\left(x^{r}\right)$ over $\operatorname{GF}(2)$, where $\mathrm{p}(\mathrm{x})$ is an AOP of degree m , is termed r -equally spaced polynomial (r-ESP) of degree mr .

It is well known that if $\mathrm{p}(\mathrm{x})$ is an irreducible AOP of degree m over $\mathrm{GF}(2)$, then $g(x)=p\left(x^{r}\right)$ is irreducible over GF(2) iff $r=(m+1)^{j} \neq 1 \bmod (m+1)^{2}$ for $j \geq 1$. An r-ESP also has an important property of $\alpha^{r(m+1)}=1$, where $\alpha$ is a root of $g(x)$. Now, let us consider the property of $\alpha^{r(m+1)}=1$, for any element $A=\sum_{i=0}^{m r-1} \bar{a}_{i} \alpha^{i} \in \mathrm{GF}\left(2^{\mathrm{mr}}\right)$ can be represented by

$$
\begin{equation*}
A=\sum_{i=0}^{(m+1) r-1} a_{i} \alpha^{i} \tag{14}
\end{equation*}
$$

where $\bar{a}_{i r+j}=a_{i r+j}+a_{m r+j}, 0 \leq j \leq r-1,0 \leq i \leq m-1 \quad[6]$.
Therefore, any element $A \in G F\left(2^{m r}\right)$ can might be defined as

$$
\begin{equation*}
A=\sum_{i=0}^{r-1} A_{i} \alpha^{i} \tag{15}
\end{equation*}
$$

$$
A_{k}=\sum_{i=0}^{m} a_{i r+k} \alpha^{i r}, \quad 0 \leq k \leq r-1
$$

Since $m+1$ is a prime and 2 is a primitive modulo $\mathrm{m}+1$, we obtain $2^{m-1}=(m+2) / 2 \bmod (m+1)$. So jr2 ${ }^{m-1}$ $\bmod (m+1) r$ is a permutation $\sigma$ on $\{0, r, 2 r, \ldots, m r\}$, i.e.,

$$
\begin{align*}
\sigma(j) & =j r 2^{m-1} \bmod (m+1) r  \tag{16}\\
& =j r(m+2) / 2 \bmod (m+1)
\end{align*}
$$

Therefore, the element $A_{k}(0 \leq k \leq r-1)$ can be re-expressed by shuffling its terms as follows

$$
\begin{equation*}
A_{k}=\sum_{i=0}^{m} a_{\sigma(i)+k} \alpha^{\omega(i)} \tag{17}
\end{equation*}
$$

With Property 1-3, hence, we concludes that $2 \sigma(\mathrm{j})=\mathrm{jr}, \quad \sigma(\mathrm{i} \pm \mathrm{j})=\sigma(\mathrm{i}) \pm \sigma(\mathrm{j})$, and $\sigma(\mathrm{m}+1)=0$. For two sub-elements $\quad A_{k}=\sum_{i=0}^{m} a_{\sigma(i)+k} \alpha^{\sigma(i)} \quad$ and $B_{h}=\sum_{i=0}^{m} b_{\sigma(i)+h} \alpha^{\sigma(i)}(0 \leq k, h \leq r-1)$, straightforwardly, the product of $A_{k} B_{h}$ is based on Theorem 1 to obtain the following results

$$
\begin{equation*}
A_{k} B_{h}=\sum_{j=0}^{m} A_{k}^{(j)} \Theta B_{h}^{(-j)} \tag{18}
\end{equation*}
$$

Theorem 2: Given two sub-element $\mathrm{A}_{\mathrm{k}}$ and $\mathrm{B}_{\mathrm{h}}(0 \leq k, h \leq r-1)$, then $\mathrm{A}_{\mathrm{k}} \mathrm{B}_{\mathrm{h}}$ multiplied by $\alpha^{\mathrm{r}}$ is equivalent to $\left\{A_{k} B_{h}\right\}^{(1)}$.
Proof: Since Theorem 2, the results of $\mathrm{A}_{\mathrm{k}} \mathrm{B}_{\mathrm{h}}$ obtain

$$
A_{k} B_{h}=\sum_{i=0}^{m} c_{i} \alpha^{i r}
$$

where

$$
c_{i}=\sum_{j=0}^{m} a_{<\pi(i)-\pi(j)+k>} b_{<\pi(i)+\pi(j)+h>}
$$

Therefore, $\mathrm{A}_{\mathrm{k}} \mathrm{B}_{\mathrm{h}}$ multiplied by $\alpha^{\mathrm{r}+\mathrm{q}}$ obtains

$$
\begin{align*}
\alpha^{r} A_{k} B_{h} & =c_{0} \alpha^{r}+c_{1} \alpha^{2 r}+\ldots+c_{m} \alpha^{m r+r} \\
& =c_{m}+c_{0} \alpha^{r}+c_{1} \alpha^{2 r}+\ldots+c_{m-1} \alpha^{m r} \\
& =\left\{A_{k} B_{h}\right\}^{(1)} \tag{19}
\end{align*}
$$

Finally, assume that two elements $\mathrm{A}=\mathrm{A}_{0}+\mathrm{A}_{1} \alpha+\mathrm{A}_{2} \alpha^{2}$ $+\ldots+\mathrm{A}_{\mathrm{t}-1} \alpha^{\mathrm{r}-1}$ and $\mathrm{B}=\mathrm{B}_{0}+\mathrm{B}_{1} \alpha+\mathrm{B}_{2} \alpha^{2}+\ldots+$ $\mathrm{B}_{\mathrm{r}-1} \alpha^{\mathrm{r}-1} \in \mathrm{GF}\left(2^{\mathrm{mr}}\right)$, then the multiplication of A and B based on Theorem 2 and 3, can be re-expressed as

$$
\begin{align*}
A B & =\sum_{i=0}^{r-1} \sum_{j=0}^{r-1}\left\{\left.\left.A_{\left|(i-j) \frac{r+1}{2}\right|} B\right|^{(i+j) \frac{r+1}{2}}\right|^{\left(w_{i}\right)} \alpha^{i}\right.  \tag{20}\\
& =C_{0}+\alpha C_{1}+\cdots+\alpha^{r-1} C_{r-1}
\end{align*}
$$

where
where

$$
\begin{align*}
C_{i} & =\sum_{j=0}^{r-1}\left(\left\{A_{\left\|(i-j) \frac{r+1}{2}\right\|^{B}} \|^{\left.(i+j) \frac{r+1}{2} \right\rvert\,}\right\}^{\left(w_{i}\right)}\right.  \tag{21}\\
& =c_{i}+c_{i+r} \alpha^{r}+\cdots+c_{i+m r} \alpha^{m r}
\end{align*}
$$

Note that $\|x\|$ denotes x modulo $\mathrm{r} ; \quad \mathrm{w}_{\mathrm{i}}=1$ if
$\left\|(i-j) \frac{r+1}{2}\right\|+\left\|(i+j) \frac{r+1}{2}\right\| \geq r$, else $\mathrm{w}_{\mathrm{j}}=0$. Let $A B=\sum_{i=0}^{r-1} \bar{C}_{i} \alpha^{i}$, where $\bar{C}_{i}=\sum_{j=0}^{m-1} \bar{c}_{i+r j} \alpha^{j r}$, then the coefficients between $\mathrm{C}_{\mathrm{i}}$ and $\bar{C}_{i}$ have the following relations

$$
\begin{equation*}
\bar{c}_{i+j r}=c_{i+j r}+c_{i+m r} \quad(0 \leq \mathrm{j} \leq \mathrm{m}-1,0 \leq \mathrm{i} \leq \mathrm{r}-1) \tag{22}
\end{equation*}
$$

As previously stated, the proposed ESP-based systolic multiplier comprises $r^{2}$ IPM and $r$ FRM units, in which the IPM array is for computing (19); the FRM unit is for (22). As a simple illustration, the bit-parallel systolic multiplier based on 3-ESP $x^{6}+x^{3}+1$ corresponding to the irreducible AOP $x^{2}+x+1$ is shown in Fig. 4. Fig. 3 demo nstrates the details of IPM and FRM circuits. In Fig. 4, the $I P M_{k, h}$ denotes the proposed that two elements $A_{k}$ and $B_{h}$ enter the IPM unit. According to (17) the input elements are shuffled before enter the IPM unit The computed result $C_{k+h(\text { modr })}$ of $I P M_{k, h}$ is to propagate to the $I P M_{k-\frac{r+1}{2}, h+\frac{r+1}{2}}$ unit. The coefficients of $C_{k+h(\text { modr })}$ which is the output of $I P M_{k, h}$ unit must performs a periodic shift-rght-by-1-bit operation if $h+k \geq r$, subjected to the relations of Theorem 2.

Generally, the proposed ESP-based multiplier over $\operatorname{GF}\left(2^{m}\right)$ which has modular systolic architecture requires $(m+r)^{2}$ AND gates, $(m+r)^{2}+m$ XOR gates, $m+r+1$ clock cycles. The proposed ESP-based systolic multiplier of larger fields can be constructed by the corresponding is based on AOP-based systolic multiplier of smaller fields. We therefore ascertain both irreducible AOP and ESP, for example of $m$ and $r, 6(3), 18(9), 20(5)$, 54(27), and 100(25). Table 2 presents a comparison among ESP-based bit-parallel multipliers. It is evident that our proposed ESP-based multiplier is able to design the bit-parallel systolic multiplier with modular architectures.

## IV. Conclusions

This paper examined a novel systolic multiplier over finite field $\operatorname{GF}\left(2^{\mathrm{m}}\right)$, which are generated by an irreducible AOP and ESP. An element representation is based on a field isomorphism from $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ into the residue polynomial ring modulo $\mathrm{x}^{\mathrm{m}+1}+1$ and $\mathrm{x}^{\mathrm{mr+r}}+1$, respectively. All of which are highly regular and able to realize with bit-parallel systolic architectures. The proposed AOP-based bit-parallel systolic multipliers efficiently improve the latency complexity from 3 m to $\mathrm{m}+2$ clock cycles. Moreover, the AOP-based systolic multipliers of smaller fields can be applied to construct all the corresponding ESP-based systolic multipliers of
larger fields. From the hardware implementation pointing of view, the primary contribution of our architectures is only able to construct the bit-parallel systolic architectures.

## References

[1] S. W. Wei, "A systolic power-sum circuit for GF( $2^{m}$ )," IEEE trans. on computers vol. 43, no. 2, pp. 226-229, Feb. 1994.
[2]C. S. Yeh, S. Reed, and T.K. Truong, "Systolic multipliers for finite fields $\operatorname{GF}\left(2^{\mathrm{m}}\right)$," IEEE trans. on computers vol. C-33, pp. 357-360, Apr. 1984.
[3] C. K. Koc and B. Sunar, "Low complexity bit-parallel canonical and normal basis multipliers for a class of finite fields," IEEE trans. on computers vol. 47, no. 3, pp. 353-356, Mar. 1998.
[4] M.A. Hasan, M. Z. Wang, and V.K. Bhargava, "A modified Massey-Omura multiplier for a class of finite fields," IEEE trans. on computers vol. C-42, No.10, pp.1278-1280, Oct. 1992.
[5] M. A. Hasan, M. Z. Wang, and V. K. Bhargava, "Modular construction of bw complexity parallel multipliers for a class of finite fields GF $\left(2^{\mathrm{m}}\right)$," IEEE trans. on computers vol. 41, no. 8, pp. 962-971, Aug. 1992.
[6] T. Itoh and S. Tsujii, "Structure of parallel multipliers for a class of fields GF $\left(2^{m}\right), "$ Info. Comp. Vol. 83, pp. 21-40, 1989.
[7]H. Wu, M. A. Hasan, and L. F. Blake, "New low-complexity bit-parallel finite field multipliers using weakly dual bases," IEEE trans. on computers vol. 47, no. 11, pp. 1223-1234, Nov. 1998.
[8]H. Wu, M. and A. Hasan, "Low-complexity bit-parallel multipliers for a class of finite fields," IEEE trans. on computers vol. 47, no. 8, pp. 883-887, Nov. 1998.
[9] W. Diffe and M. Hellman, "New directions in cryptography," IEEE trans. Inform. theory, vol. IT-22, pp. 644-654, 1976.
[10] Y. R. Shayan, Tho Le-Ngoc, and V.K. Bhargava, "Binary-decision approach to fast chien search for software decoding of BCH codes," IEE proc. Vol. I34, Pt.F, No.6, pp. 629-632, Oct. 1987.
[11] E. R. Berlekamp, Algebraic Coding Theory, revised Laguna Hills, CA: Aegean Park, 1984.
[12] D. E. R. Denning, Cryptography and data security. Reading, MA: Addison-Wesley, 1983.
[13] A. M. Odlyzko, "discrete logarithms in finite fields and their cryptographic significance, " in Adv. Cryptol., proc. Eurocrypt '84, Paris, France, pp. 224-314, Apr. 1984.
[14] M. A. Hasan and V. K. Bhargava, "Architecture for a low complexity rate-adaptive Reed-Solomon encoder, " IEEE trans. on computers, vol. 44, no. 7, pp. 938-942, Jul. 1995.
[15] J. J. Wonziak, "Systolic dual basis serial multiplier, " IEE Proc.-Comput. Digit. Tech. Vol. 145, No. 3, pp. 237-241, May 1998.
[16] G. Drolet, "A new representation of elements of finite fields $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ yielding small complexity arithmetic," IEEE trans. on computers, vol. 47, no. 9, pp. 938-946, Sep. 1998.
[17] J. Omura and J. Massey, "Computational method
and apparatus for finite field arithmetic," U.S. Patent Number 4587627, May 1986.
[18] M. Diab and A. Poli, "New bit-serial systolic multiplier for $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ using irreducible trinomials," Electron. Letters, vol.27, No. 13, pp. 1884-1885, Jul. 1991.

Table 1: Comparison of the related parallel systolic multipliers

| Multiplier | Wei [1] | Yeh [2] | Proposed <br> IOPM (Fig. 3) |
| :---: | :---: | :---: | :---: |
| Architecture | systolic | Systolic | Bit-parallel <br> systolic |
| Function | $\mathrm{AB}^{2}+\mathrm{C}$ | AB | $\mathrm{AB}+\mathrm{C}$ |
| The total of gates <br> Complexity <br> \# 2-input AND <br> \# 2-input XOR <br> \# 3-input XOR <br> \# 1-bit latches | $3 \mathrm{~m}^{2}$ <br> $\mathrm{~m}^{2}$ | $\mathrm{m}^{2}$ <br> $10 \mathrm{~m}^{2}$ | $2 \mathrm{~m}^{2}$ <br> 0 <br> $7 \mathrm{~m}^{2}$ |
| Computation <br> time <br> per cell | $(\mathrm{m}+1)^{2}$ <br> $(\mathrm{~m}+1)^{2}+\mathrm{m}$ <br> 0 <br> $\mathrm{~T}_{\mathrm{AND}}+\mathrm{T}_{3 \mathrm{XO}}$ <br> R | $\mathrm{T}_{\mathrm{AND}}+\mathrm{T}_{\mathrm{XOR}}$ | $\mathrm{T}_{\mathrm{AND}}+\mathrm{T}_{\mathrm{XOR}}$ |
| latency | 3 m | 3 m | $\mathrm{~m}+2$ |

Fig. 1. The circuit of the $\mathrm{j}^{\text {th }}$ inner product step procedures (IPSP)

Table 2: Comparison of parallel multipliers of $\operatorname{GF}\left(2^{\mathrm{m}}\right)$ generated by an irreducible r-ESP of degree m

| multipliers | architecture | basis | function | \#AND | \#XOR | Cycle time |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ITM[6] | modular | polynomi <br> al | AB | $(\mathrm{m}+\mathrm{r})^{2}$ | $(\mathrm{~m}+\mathrm{r})^{2}-\mathrm{r}$ | $\mathrm{T}_{\mathrm{A}}+\left(\left\lceil\log _{2} \mathrm{~m}\right\rceil+\left\lceil\log _{2}(\mathrm{~m}-\mathrm{r}+1)\right\rceil\right) \mathrm{T}_{\mathrm{X}}$ |
| HWBM[5] | modular | polynomi <br> al | AB | $\mathrm{m}^{2}$ | $\mathrm{~m}^{2}+\mathrm{m}-2 \mathrm{r}$ | $\mathrm{T}_{\mathrm{A}}+\left(\mathrm{m} / \mathrm{r}+\left\lceil\log _{2} \mathrm{~m}\right\rceil\right) \mathrm{T}_{\mathrm{X}}$ |
| WDBM[7] | modular | weakly <br> dual | AB | $\mathrm{m}^{2}$ | $\mathrm{~m}^{2}-\mathrm{r}$ | $T_{A}+\left\{\left\lceil\log _{2} \frac{m}{r}\right\rceil+\left[\log _{2}\left(r+\left[\frac{m-r}{2\left\lceil\log _{2} \frac{m}{r}\right.}\right\rceil\right]\right)\right] T_{x}$ |
| Presented <br> ESPM <br> (Fig. 4) | Modular <br> systolic | polynomi <br> al | $\mathrm{AB}+\mathrm{C}$ | $(\mathrm{m}+\mathrm{r})^{2}$ | $(\mathrm{~m}+\mathrm{r})^{2}+\mathrm{m}$ | $\mathrm{T}_{\mathrm{A}}+\mathrm{T}_{\mathrm{X}}$ |



Fig. 2. The final reduced modulo $p(\alpha)$ (FRM) unit


Fig. 4. The configuration of ESP-based systolic multiplier over GF $\left(2^{6}\right)$


Fig, 3. The bit-parallel systolic multiplier over $\mathrm{GF}\left(2^{4}\right)$ based on an irreducible AOP

