Multiobjective VLSI Cell Placement Using Distributed Simulated Evolution Algorithm


King Fahd University of Petroleum & Minerals
Computer Engineering Department
Dhahran 31261, Saudi Arabia
E-mail: \{sadiq,mustafa,alizaidi\}@ccse.kfupm.edu.sa

Abstract—Simulated Evolution (SimE) is a sound stochastic approximation algorithm based on the principles of adaptation. If properly engineered it is possible for SimE to reach near-optimal solutions in lesser time than Simulated Annealing [1], [2]. Nevertheless, depending on the size of the problem, it may have large run-time requirements. One practical approach to speed up the execution of SimE algorithm is to parallelize it. This is all the more true for multi-objective cell placement, where the need to optimize conflicting objectives (interconnect wirelength, power dissipation, and timing performance) adds another level of difficulty [3]. In this paper a distributed parallel SimE algorithm is presented for multiobjective VLSI standard cell placement. Fuzzy logic is used to integrate the costs of these objectives. The algorithm presented is based on random distribution of rows to individual processors in order to partition the problem and distribute computationally intensive tasks, while also efficiently traversing the complex search space. A series of experiments are performed on ISCAS-85/89 benchmarks to compare speedup with serial implementation and other earlier proposals. Discussion on comparison with parallel implementations of other iterative heuristics is included.

I. INTRODUCTION

Simulated Evolution algorithm (SimE) is a general search strategy for solving a variety of combinatorial optimization problems [2]. It operates on a single solution, termed as population. Each population consists of elements. In case of the placement problem, these elements are cells to be moved. The algorithm has one main loop consisting of three basic steps, Evaluation, Selection and Allocation.

In the Evaluation step, goodness of each element is measured as a single number between '0' and '1', which is an indicator of how near the element is from its optimal location. Then comes Selection, which is the process of selecting elements which are unfit (badly placed) in the current solution. An individual having high goodness measure still has a non-zero probability of being selected. It is this element of non-determinism that gives SimE the capability of escaping local minima. The last step, Allocation, has the most impact on the quality of solution. Its main function is to mutate the population by altering the location of selected cells.

The above three steps are executed in sequence until no noticeable improvement to the population goodness is observed after a number of iterations, or a fixed number of iterations are completed.

The pseudo-code of SimE is similar to that given in Figure 1 [1]. Although the illustration depicts the slave process to be discussed later, if the entire set of rows is allocated to a single processor, then the execution of the algorithm is the same as that of the serial SimE.

For large test cases, SimE has large runtime requirements. The reason is that, like other stochastic iterative algorithms, SimE is blind. It has to be told when to stop. Depending on which stopping criteria are used, as well as the size of the problem, SimE may consume hours of CPU time before it stops. The most practical approach to speed up the execution of SimE algorithm is to parallelize it. Unlike Simulated Annealing [4], [5] Genetic Algorithms [6] and Tabu Search [7] the parallelization of SimE has not been the subject of much research. Kling and Banerjee suggested three ways of speeding up the SimE algorithm [2], [8].

A parallelization strategy for VLSI cell placement for a single objective (wirelength) was attempted on a network of workstations [2], where each station is assigned a number of rows of the placement problem, in a pre-determined order. The stations executes one iteration of the SimE algorithm on the cells of the rows assigned to it. In each iteration, the rows are redistributed among the processors in a predetermined order [2].

In this paper we are addressing the problem of parallelizing SimE to solve the multiobjective VLSI standard cell placement by using a cluster of low cost PCs. The goal is to achieve a placement quality very near or equal to that achieved by serial algorithm, but with run times that decrease linearly (or super-linearly) with increasing number of processors.

In the next section we present the details of our NP-hard, multiobjective, VLSI cell placement problem. Problem formulation and models for estimating the costs for the various objectives to be optimized are presented. In Section III, the distributed algorithm is detailed. Experimental setup, results obtained on ISCAS benchmark circuits, and other observations are given in Section IV, followed by Conclusion in Section V.

II. MULTIOBJECTIVE FUZZY COST FUNCTION

In this section, we formulate our multiobjective fuzzy cost function used in the optimization process.
The objectives considered in our problem include: optimizing power consumption, improving timing performance (delay), and reducing overall wirelength, while considering layout width as a constraint. A semi-formal description of the placement problem can be found in [3]. The multiobjective cost function is similar to the one formulated in [9]. The first objective, wirelength cost ($Cost_{wire}$) is estimated using an approximate Steiner tree algorithm.

The power consumption cost $p_i$ is computed for each net $i$. Assuming a a fix supply voltage and clock frequency, the estimate can be obtained by $p_i \approx C_i \cdot S_i$, (where $S_i$ is the switching probability and $C_i$ the total capacitance, of net $i$). This can be further improved to $p_i \approx l_i \cdot S_i$ (since interconnect capacitances are a function of the interconnect lengths, and the input capacitances of the gates are constant). The total estimate of the power dissipation reduces to $Cost_{power} = \sum_{i \in M} p_i = \sum_{i \in M} (l_i \cdot S_i)$.

The delay cost is taken as the delay along the longest path in a circuit. The delay $T_\pi$ of a path $\pi$ consisting of nets $\{v_1, v_2, ..., v_k\}$, is expressed as: $T_\pi = \sum_{i=1}^{k} (CD_i + ID_i)$ where $CD_i$ is the switching delay of the cell driving net $v_i$ and $ID_i$ is the interconnect delay of net $v_i$. The placement phase affects $ID_i$ because $CD_i$ is technology dependent parameter and is independent of placement; $Cost_{delay} = \max\{T_\pi\}$.

The layout width is constrained not to exceed a certain positive ratio $\alpha$ to the average row width $w_{avg}$.

Since we are optimizing three objectives simultaneously, we need to have a cost function that represents the effect of all three objectives in the form of a single quantity. We use fuzzy logic to integrate these multiple, possibly conflicting objectives into a scalar cost function. Fuzzy logic allows us to describe the objectives in terms of linguistic variables. Then, fuzzy rules are used to find the overall cost of a placement solution. In this work, we have used following fuzzy rule:

**IF** a solution has **SMALL wirelength** AND **LOW power consumption** AND **SHORT delay** **THEN** it is an **GOOD** solution.

The above rule is translated to and-like OWA fuzzy operator [10] and the membership $\mu(x)$ of a solution $x$ in fuzzy set **GOOD solution** is obtained by:

$$\mu(x) = \left\{ \begin{array}{ll}
\beta \cdot \min_{j=p,d,l} \{\mu_j(x)\} + (1 - \beta) \cdot \frac{1}{3} \sum_{j=p,d,l} \mu_j(x); & \text{if } Width - w_{avg} \leq \alpha \cdot w_{avg}, \\
0; & \text{otherwise.}
\end{array} \right. $$

Here $\mu_j(x)$ for $j = p,d,l$, width are the membership values in the fuzzy sets **LOW power consumption**, **SHORT delay**, and **SMALL wirelength** respectively, $\beta$ is the constant in the range $[0,1]$. The solution that results in maximum value of $\mu(x)$ is reported as the best solution found by the search heuristic. The membership functions for fuzzy sets **LOW power consumption**, **SHORT delay**, and **SMALL wirelength** and the lower bounds for different objectives can be found in [9].

### III. DISTRIBUTED SIMULATED EVOLUTION ALGORITHM

The parallelization of the SimE algorithm is carried out by partitioning the workload among available processors. The partitioning is done according to rows. The workload for each slave in the cell placement problem is the computation of SimE operations of Evaluation, Selection, and Allocation on it’s assigned rows [2].

The row allocation pattern that was proposed in [2] is made up of two alternating sets of rows. In the even iterations, each
slave gets a slice of $\left\lceil \frac{K}{m} \right\rceil$ rows, (where $m$ is the number of slaves, and $K$ is the total number of rows in the placement) while in the odd iterations the $j^{th}$ slave gets the set of rows $j$, $j + m$, $j + 2m$, and so on. It has been shown that with the above fixed pattern of assigning rows to slaves in alternate steps, each cell can move to any position on the grid in at most two steps [2]. The consequence of row partitioning however is that the each processors has only a partial view of the placement. This hinders free cell movement, making it more difficult for cells to reach their optimal locations. Results from implementing this strategy on our multiobjective optimization problem revealed that even when given a large amount of time, the best solution obtained was poorer than one achieved by the serial implementation.

Though the lack of a global placement view will always exist in case of a distributed algorithm, the effects of restrictive cell movement can be alleviated by using a better row allocation pattern. Use of a pattern that facilitates a variety of combination among the rows sounds intuitively better. This lead us to experiment with a random row allocation.

The pseudo code of the parallel simulated evolution is illustrated in Figures 1 and 2. As can be seen, one of the processors (the master) is in-charge of running SimE on a particular partition as well as performing the following tasks periodically at the end of each iteration: (1) receive the partial placements from all other processors and combine them into a new solution and evaluate its fitness, (2) partition the new solution to obtain a new row allocation, and finally, (3) distribute the resulting sub-populations among the processors. The number of rows randomly assigned depends on the size of the placement and the number of processors. This is repeated for all iterations until the termination condition is met.

IV. RESULTS AND DISCUSSION

The parallel SimE strategies mentioned were implemented in C/C++ using MPICH Message Passing Interface implementation ver 1.2.4. for communication between nodes. The experimental environment used consists of a dedicated cluster of 8 Pentium IV 2 GHz PCs with 256 MB RAM, running RedHat Linux ver 7.3 connected with a fast Ethernet switch. ISCAS-89 circuits are used as performance benchmarks for evaluating the parallel SimE placement techniques. These circuits are of various sizes in terms of number of cells and paths, and thus offer a variety of test cases.

Table I shows the amount of time taken to reach a predefined fitness objective with increasing number of processors for both the proposed random row allocation strategy, and the fixed row allocation strategy. For the proposed strategy, as can be seen, there is a constant decrease in runtime for all circuits. Better trends are observed for medium to large circuits, than for smaller ones, as can be seen in Figure 3(a). Speedup is also illustrated in the bar-chart given in Figure 3(b). Due to space restrictions, and scaling factor limitations, not all results have been included in the same figure for sake of clarity.

The fitness values achieved with the proposed row allocation are consistently higher in all test cases when compared to the fixed row allocation scheme, as shown by the *Qual Fixed* column in Table I, the fixed row allocation never equals 100% of the solution quality obtained by the proposed scheme. Further, the run times are far better, and the speedup is super linear in most cases. This can be attributed to modified working space of the selection and allocation operators on each slave, as in each iteration different sets and combination of rows are addressed. This has resulted in even more reduced times to obtain desired solution quality than with workload partitioning alone.

A. Comparison With Other Iterative Heuristics

The runtimes and solution quality was also compared with those obtained from parallelizing simulated annealing [4], genetic algorithms (a distributed search space parallel strategy) [6], and tabu search [7]. For GAs, the time for completion to obtain solutions of a certain pre-specified quality were exorbitantly high. And in some cases, for the given run-time, acceptable solutions could not be obtained. For example, for the S1494, the serial GA implementation took 1883 seconds, and when the parallel version was executed on 7 processors the best time was 418 seconds (with 8% inferior quality than
processors. This speedup trend was compared to other established methods with increasing number of processors.

While better quality was obtained in some cases at the cost of high computation time, for the same quality the run-time requirements for TS were over three times more than that required by parallel SimE. For example, for s1494, the time taken by serial TS was 268 seconds, and when parallel TS was run on 6 processors, the runtime was 57 Seconds (compared to 5 Seconds by SimE) with slightly better quality, and TS took over 15 Seconds to obtain solutions of same quality as SimE. A similar trend was seen for all circuits.

For simulated annealing, the asynchronous multiple-Markov chain parallelization strategy was chosen [4]. Like TS, Parallel SA was also able to achieve slightly better quality solutions than SimE, given enough time. However, for a fixed quality, SimE was seen to be increasingly faster than SA as processors were increased. For instance, for s1494, with 2 processors SA took 86 seconds to achieve the desired quality, while SimE took only 17 seconds. With 5 processors, SA required 63 seconds on average, while SimE needed only 6. Similar trends are seen for most circuits.

For appreciable quality solutions, SimE has exhibited dramatic speedups with increase in number of processors, even when compared to other, more established heuristics. The results obtained suggest that in scenarios where placement quality considerations are overridden by design time constraints, the proposed parallel SimE algorithm should be favored.

V. Future Work, Conclusion & Discussion

This paper presented the application of a modified Distributed SimE algorithm to a multi-objective VLSI cell placement problem. The algorithm focused on distributing the work load among processors. Random allocation of work load in each iteration resulted in better traversal of search for our complex multiobjective NP-hard design problem.

The results showed a significant reduction in runtime for all circuits, although the speedup was more obvious for larger ones. This speedup trend was compared to other established iterative and evolutionary heuristics from literature, and was shown to be more consistent with increasing number of processors.

This work can be extended along several lines. One would be to investigate suitable parameters for the SimE algorithm that will enable better quality and run-times. At the moment, the same parameters that have been set for serial execution are used. Another approach is to relieve computational resources during execution when the quality ceases to improve. This can be achieved by modifying the stopping criteria. If the quality does not improve for the last \( j \) iterations on \( k \) processors, then the number of processors can be reduced to \( k - 1 \), and this can continue until all processors are relieved. The effects of this experiment will be, that while execution continues to improve the obtained best solution, the distribution of increased number of rows on reduced number of processors will enable exploring different regions of the search space in the same run, and will hopefully result in better quality with reduced resources. Our initial experiments on this idea have been encouraging.

ACKNOWLEDGMENT

The authors thank King Fahd University of Petroleum & Minerals (KFUPM), Dhahran, Saudi Arabia, for support under Project Code COE/CHELPLACE/263.

REFERENCES


<table>
<thead>
<tr>
<th>Circuit Name</th>
<th># of Cells</th>
<th>Random Row Distribution</th>
<th>Qual Distribution</th>
</tr>
</thead>
<tbody>
<tr>
<td>Name</td>
<td>( N_p=1 )</td>
<td>( N_p=2 )</td>
<td>( N_p=3 )</td>
</tr>
<tr>
<td>s641</td>
<td>433</td>
<td>UH 4.99</td>
<td>4.97</td>
</tr>
<tr>
<td>s1238</td>
<td>534</td>
<td>16.5 9.24</td>
<td>9.29</td>
</tr>
<tr>
<td>s1494</td>
<td>661</td>
<td>67 17.4 6.15</td>
<td>4.88</td>
</tr>
<tr>
<td>s1488</td>
<td>667</td>
<td>60.23</td>
<td>24.6</td>
</tr>
<tr>
<td>s3330</td>
<td>1961</td>
<td>UH 6.78.02</td>
<td>115</td>
</tr>
<tr>
<td>s5378</td>
<td>2993</td>
<td>UH 1.620</td>
<td>338.2</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>