2

Analog Integrated Circuits and Signal Processing, 43, 183–190, 2005 © 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

# A New Static Differential CMOS Logic with Superior Low Power Performance

## MUHAMMAD E.S. ELRABAA

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia

Received May 12, 2004; Revised October 24, 2004; Accepted November 29, 2004

Abstract. A new differential static CMOS logic (DSCL) family is devised. The new circuit is fully static, making it simple to design. The circuit topology of the DSCL and its operation is explained. Delay optimization of the new circuits was performed. It showed the fully static behavior of these circuits. Their performance in terms of delay, power, and area is compared to that of conventional static differential logic and dynamic differential logic. Spice simulations using a 0.18  $\mu$ m technology with a power supply of 1.8 V was utilized to evaluate the performance of the three circuits. Two different sets of simulations were carried out; one with equal input capacitances of all circuits and another with equal circuit delays. For each design, all circuits were optimized for minimum delay. It is shown that at equal input capacitance, the DSCL achieved 40% less delay than the DCVSL at one third the power. Also, at equal delay, the DSCL achieved 20% of the power dissipation of the DCVSL and 78% of the DDCVSL making it the most energy-efficient among the three circuits.

Key Words: low-power, differential circuits, digital integrated circuits, static circuits

#### 1. Introduction

One of the first realization of static differential CMOS logic known as the Differential Cascode Voltage Switch Logic (DCVSL) was introduced in 1984 [1]. Since then researchers have shown great interest in differential logic. This is due to its potential to efficiently realize complex logic functions such as XOR/XNOR and multiplexing which form the basic building blocks for most data path units (e.g. adders, multipliers, registers ... etc.). Also due to their dual rail nature, they can be used to implement self-timed logic [2]. A completion signal is generated when the two rails are different (i.e. after the switching is complete).

Many changes to the basic DCVSL, shown in Fig. 1, were proposed to improve its performance. In [3] many of these techniques were evaluated. They ranged from static techniques with reduced internal voltage swings to dynamic techniques with different methods of pre-charging the outputs of the differential gate. In that work, the dynamic implementation of the DCVSL, Fig. 2, was shown to be the fastest and most energy-efficient technique. Recently, more dynamic techniques were proposed [4, 5] with improved power/delay performance over the conven-

tional differential Domino. However, these techniques add huge design complexity and require complex clocking.

In all of these techniques, the static ones slightly improved the speed at the expense of increasing the power consumption while the dynamic techniques significantly improved the speed but increased the power even further. This is due to the increased activity factor (switching probability) resulting from the fact that one of the two outputs of a dynamic differential gate will always switch during evaluation. This property of dynamic circuits has limited its use to highly critical paths where power is sacrificed for speed.

In [6] a differential static logic was obtained by tying the outputs of two conventional static CMOS gates to a back-to-back PMOS keeper. One of the gates implements the function while the other implements its complement. This yielded a good performance but increased the number of used transistors (and hence area) significantly. As will be shown in Section 3, the newly proposed **DSCL** logic reduces the number of transistors significantly by using a trulydifferential style combined with a design optimization methodology. 184 Elrabaa



*Fig. 1.* The circuit topology of the conventional DCVSL gate (a 2-input XOR/XNOR).



Fig. 2. The dynamic DCVSL (DDCVSL) gate structure.

In the next section the operation of the conventional DCVSL will be briefly explained to point out the cause of the inherent low performance of these gates. The newly proposed static differential logic, called differential static CMOS logic (DSCL), will be introduced in Section 3. Extensive SPICE simulations using numerous ring oscillators used to optimize the design of the DSCL gates will also be presented in that section. Performance comparisons, in term of speed and power, with conventional DCVSL and dynamic DCVSL are presented in Section 4. Finally, conclusions are presented in Section 5. All simulations were carried out using a 0.18  $\mu$ m technology with a power supply of 1.8 V. For this technology, HSPICE<sup>®</sup> simulations were carried out using BSIM3.v3 MOS models for accurate results.



Fig. 3. The input/output waveforms of the DCVSL gate.

#### 2. Conventional DCVSL

As shown in Fig. 1, the DCVSL gate does not have a full pull-up logic tree. Instead, a latch made of two back-to-back PMOSFETs is used to transform the logic 0 of one output to logic 1 on the other. This means that the output switching from low-to-high (L-H) will always lag the output switching from high-to-low (H-L) as shown in Fig. 3. Also the output having an H-L transition will suffer from contention between the NMOS logic tree and the pull-up PMOS which is initially ON. This contention, as evident from Fig. 3, causes the H-L output transition to have a slow last portion. These two facts cause the degraded performance of DCVSL gates in the form of:

- A slower speed since the L-H edge is always lagging the H-L edge. In fact, since the H-L output transition is initiated by L-H transitions at the inputs, this effect is compounded over multi-logic levels. Hence the total delay of a DCVSL logic path will equal the sum of H-L propagation delays of the path's gates, not the average of the H-L and L-H gate delays as with other logic families.
- 2. An increased power dissipation due to the contention between the PMOS and NMOS logic tree.

The dynamic version of the DCVSL eliminates the contention problem by pre-charging the outputs to VDD as shown in Fig. 2. However this gate suffers from the following:

1. Complexity of design: The timing of the clock signal is very crucial. If the clock arrives to a gate too early,

while all the inputs to the gate are still pre-charged high, both outputs will start discharging. This will cause an increased delay as well as power (due to contention at both outputs of the gate). If the clock is made too slow, then it gets into the delay path; i.e. the path delay would be determined by the clock rather than the logic function being evaluated. All this mean that the clock distribution circuits would have to be carefully designed and checked against all process/supply/temperature corners. Also, differing path delays cause the inputs to a gate to arrive at different times, producing false output transitions. Again, careful design must be employed to eliminate these conditions or reduce them.

2. Increased dynamic power due to the higher activity factor (switching probability) of DDCVSL gates and false transitions. In fact, due to its dual rail, the switching activity of DDCVSL gates is 100% (i.e. one of the outputs will always switch in every clock cycle).

#### 3. The Proposed DSCL

#### 3.1. General Description

The new DSCL gate, illustrated in Fig. 4, is obtained by adding a PMOS logic tree pull-up section. A total of 14 transistors is needed compared to 18 transistors in [6], a 22% reduction. The savings will even be larger for larger fan-ins (e.g. for a 3-input XOR/XNOR gate, the savings would be  $\sim 45\%$ ). With the addition of the PMOS pull-up tree, the output L-H transition would start at the same time as the H-L transition. The PMOS cross-coupled pull-up latch is still retained to assist with the pull-up. However, the contention with the NMOS pull-down tree is greatly reduced by the PMOS pull-up tree. This makes the L-H and H-L transition delays almost equal, as illustrated in Fig. 5. For this figure, the P/N ratio (PMOS/NMOS size ratio) was set equal to the devices' mobility ratio (i.e.  $\approx 2.5$ ). The total device sizes per input (and hence the input capacitance) and the load capacitance (50 fF, a Fan out of 1.5) were made equal to that of the DCVSL case. Also, the PMOS latch transistors were made smaller than the DCVSL case since the pull-up action is mainly done by the PMOS logic tree, thus reducing the contention (and power) further. As the figure shows, the DSCL gate had an average propagation delay that is 40% less than the L-H delay of the DCVSL gate. As was explained in



*Fig.* 4. The circuit topology of the new DSCL gate (a 2-input XOR/XNOR).



Fig. 5. The input/output waveforms of the new DSCL gate.

Section 2, the L-H delay of the DCVSL determines the whole path delay and that is why it is considered for the comparison.

Figure 6 shows the supply currents for the DCVSL and DSCL gates during switching. As the figure shows, the peak supply current of the DSCL gate is almost one third of the DCVSL's. The power ratio between the DCVSL and DSCL was 3.48 (i.e. the DSCL had a



*Fig.* 6. The supply currents of the DCVSL and the new DSCL gates during output switching.

72% less power consumption). This difference is due to the contention in the DCVSL gate.

#### 3.2. P/N Ratio Optimization

In order to obtain the optimum P/N ratio for the DSCL circuit, several ring oscillators were simulated. These oscillators have 2-input DSCL XOR/XNOR gates as delay stages. Two values of Fan out were used; 1 and 3. The average gate delay was measured as the period of oscillation divided by the number of stages. The P/N ratio was varied from 0.15 to 4 while keeping the total gate's input capacitance constant (i.e. P + N = constant). Also, for each value of Fan out, three sizes of cross-coupled PMOS (CCP) latch were used; 0 (i.e. no CCP), 1 and 3  $\mu$ m.

Ring oscillators were used because they capture both the effects of Fan out as well as input slope on the delay. Also, average delays are readily available from the oscillation frequency.

Figures 7(a) and (b) show the normalized delay of the DSCL gates for the two values of fan out. The delays were normalized to the delay of the DSCL gate with no CCP, fan out of 1 and a P/N ratio of 1.

Also shown in these figures is the normalized energy per transition (E/T) of these gates. These were obtained by integrating the transient power during switching (L-H and H-L transitions) over the switching time. Again, for each fan out, the values of E/T were normalized to that with no CCP and P/N ratio of 1. The E/T was used to evaluate the effects of circuit optimization on power for the two values of fan out. This is possible since E/T is independent of the circuit's delay (and hence oscillating frequency) which varies with the P/N ratio. The following can be observed from these results:

- 1. The minimum delay is obtained at a P/N ratio of about 1.3 for all conditions of Fan out and CCP sizes. This is consistent with static CMOS gates. A ring oscillator of 2-input CMOS NANDs would have a minimum delay at that P/N ratio (~1.3) [7]. This shows the static behavior of the DSCL gates.
- 2. Having a small CCP keeper improves the delay slightly. This is however due to the small loads used in these simulations. At large Fan outs or wiring capacitances, the CCP size will have to increase and it would have a larger impact on the delay. This is why it was retained in the DSCL.
- 3. The E/T had a relatively flat response with the P/N ratio. This is because the total input capacitance was kept constant. This also shows that the rush-through currents (VDD to GND currents during switching) are fairly independent of the P/N ratio. This independence is re-enforced by the inputs slope; as P/N ratio is decreased, the H-L slope increases and the L-H slope increases, causing the larger NMOS transistors in the next stage to turn-off faster (and hence reducing the rush-through current). The opposite occurs at higher P/N ratios. Again, a fully static behavior of the DSCL gates.
- 4. Retaining the CCP keeper actually slightly reduced the E/T for most values of P/N ratio. This is more evident for the case of Fan out of 3. This is a direct result of improving the delay and output slopes which reduces the rush-through currents.

Figure 8 shows the DSCL's differential outputs crossing point versus the P/N ratio. These results were obtained from the same simulations reported in Fig. 7. The fan out did not impact the crossing point at all, hence one set of results are reported (fan out = 1). As the H-L transition becomes faster than the L-H transition, the crossing point moves up and vise versa. The crossing points were normalized as:

Normalized Crossing Point

= (Crossing point  $-0.5 V_{DD}$ )/(0.5  $V_{DD}$ )

So it shows the deviation (as a fraction) from the ideal crossing point of half the supply (i.e.  $V_{DD}/2$ ). This figure shows the clear correlation between the crossing point and the delay. At the optimum P/N ratio of 1.3, the normalized crossing point is about 0 (i.e. the outputs cross one another at  $V_{DD}/2$ ).



*Fig.* 7. The normalized delay and E/T of the DSCL gate as a function of the P/N ratio for 3 values of the CCP keeper size. Two sets of results are shown for two values of Fan out (1 and 3). Delays were normalized to the delay with no CCP, fan out of 1 and a P/N ratio of 1. E/T values were normalized separately for each fan out to the value with no CCP and a P/N ratio of 1 for that fan out.

#### 4. Performance Comparisons

The performance of the new DSCL was compared to that of the DCVSL and the DDCVSL utilizing an 8-bit carry-ripple-adder and SPICE simulations. This type of adder is very efficiently implemented by the differential circuits at hand. The worst case delay (i.e. longest path delay) and average power consumption were evaluated for the three circuit types. Two types of comparisons were made; (1) Equal input capacitance comparison and (2) Equal delay comparisons. For all simulations the load capacitances at the adders outputs were set to 100 fF.

#### 4.1. Equal Input Capacitance Comparisons

The following design procedure was used in designing the 3 adders for these comparisons:

1. For the conventional DCVSL and the new DSCL implementations, the sizes of the PMOS latch

188 Elrabaa



Fig. 8. The normalized crossing point of the DSCL gate as a function of the P/N ratio for a Fan out of 1. Results for the three values of CCP keeper size are shown. The crossing points were normalized as: Normalized cross point =  $(Crossing point - 0.5 V_{DD})/(0.5 V_{DD})$ .



Fig. 9. The clocking methodology for the DDCVSL adder. Total number of logic levels is 8.

transistors were optimized for minimum delay for each circuit.

- 2. For the DSCL circuit, these sizes were also optimized in conjunction with the optimum P/N device size ratios (obtained in Section 3 above).
- 3. For the DDCVSL implementation, the clocked PMOS devices were sized such that the total precharging time is equal to the total delay. This is a standard design practice for such dynamic circuits to ensure they pre-charge within the low clock phase. This is especially important for this type of circuits since the pre-charging process actually 'propagates' from one logic level to the next. As for the PMOS latch transistors, the same size that was obtained for the conventional DCVSL implementation was used. This is to ensure that if a false transition occurs to a gate's output, it will be corrected with a worst case delay equal to that of an equivalent DCVSL gate.
- 4. Also to reduce glitches and false transitions, the clock was delayed between the different logic stages

as shown in Fig. 7 below. The clock timing was optimized to get the maximum speed out of the DD-CVSL adder. This also minimized false transitions and the associated increase in power consumption. The first clock inverter was added to account for the power consumed by the clocked devices in the first two logic stages.

Table 1 below summarize the normalized results of the 3 circuit types. The following can be observed:

- 1. The conventional DCVSL adder had a significantly worse delay due to the compounded effect of the slow L-H output transitions.
- 2. As expected the DDCVSL achieved the highest speed (almost  $2 \times$  of the DCVSL) since there is no dependency on the PMOS transistors for pull-up during evaluation. However, this came at the cost of  $1.5 \times$  increase in power which is due to the higher activity factor.
- 3. The new DSCL achieved a  $1.6 \times$  speed improvement over the DCVSL with actual one third of the power, a very significant result. This is due to the

Table 1. Normalized simulation results with equal input capacitances for the 3 adders.

|                     | DCVSL | DDCVSL | DSCL  |
|---------------------|-------|--------|-------|
| Delay               | 1X    | 0.51X  | 0.62X |
| Power               | 1X    | 1.53X  | 0.33X |
| Power-delay product | 1X    | 0.78X  | 0.21X |

*Table 2.* Normalized simulation results with equal delays for DCVSL and DSCL adders.

|               | DCVSL | DSCL  |
|---------------|-------|-------|
| Power         | 1X    | 0.22X |
| Area (Active) | 1X    | 0.4X  |

elimination of contention combined with its lower activity factor.

4. The DSCL achieved the lowest power-delay product of the three circuits making it the most energy efficient of the three.

#### 4.2. Equal Delay Comparisons

The DSCL adder was redesigned to have an equal delay to that of the DCVSL implementation. Table 2 below summarizes the results of comparisons between the two adders. The area of each adder was estimated as the total active area (i.e. the total width of all transistors). While this is not an accurate estimation of the absolute area, it serves as a good estimate for the relative area ratio of the two adders. Again, the DSCL adder achieved a power that is about 20% of that of the DCVSL adder at 40% of the area. It can also be noted that the power-delay ratio between the two adders remain at  $0.22\times$ , which affirm the correctness of the design procedure.

The original DSCL adder was also compared with a re-designed DDCVSL adder with equal delay. The new DDCVSL adder was completely re-designed along with its clock distribution circuitry to eliminate glitches and false transitions at the required delay. The results of simulations are summarized in Table 3. The DSCL achieved a 22% less power; however, its area was 60% larger. This is due to the fact that the delay of these adders is dominated by the internal fan out rather than the external load. The DDCVSL adder had to be made very small to have an equal delay to that of the DSCL adder.

*Table 3.* Normalized simulation results with equal delays for DDCVSL and DSCL adders.

|               | DDCVSL | DSCL  |
|---------------|--------|-------|
| Power         | 1X     | 0.78X |
| Area (Active) | 1X     | 1.6X  |

## 5. Conclusions

A new fully static differential CMOS logic circuit (DSCL) was devised. This circuit eliminates the contention between the PMOS back-to-back latch transistors and the pull-down Logic tree that existed in conventional DCVSL circuits. P/N ratio optimization performed on the DSCL gates showed that the optimum value to be around 1.3. Also, the delay and Energy per transition simulation results show the completely static behavior of the new circuits. The performance of the DSCL circuit was compared to that of the DCVSL and the dynamic DCVSL using SPICE simulations of 8-bit CRAs. For the same input capacitance, the new DSCL achieved 40% less delay than the DCVSL at one third the power, a very significant accomplishment. Though the dynamic DCVSL achieved the lowest delay, this was at the cost of a  $1.5 \times$  increase in power over the DCXVSL and  $5 \times$  over the DSCL. This very high power of the DDCVSL, added to the difficulty of design, makes the DSCL a very attractive option. At equal delay, the DSCL achieved 20% of the power dissipation of the DCVSL and 78% of the DDCVSL, making the DSCL the most energy efficient among all differential circuits since DDCVSL was shown in [3] to be the most energy-efficient among all other differential CMOS logic families.

### Acknowledgment

The author is grateful for the facilities support provided by King Fahd University of Petroleum and Minerals.

#### References

- L.G. Heller, et. al., "Cascode voltage switch logic: A differential cmos logic family," in *Proc. IEEE Int. Solid-State Circuits Conf.*, 1984, pp. 16–17.
- Y.K. Tan and Y.C. Lim, "Self-Timed precharge latch," in *Proc.* Int. Symp. Circuits and Systems., 1990, pp. 566–569.
- P. Ng, P.T. Balsara, and D. Steiss, "Performance of CMOS differential circuits," *IEEE J. of Solid-State Circuits*, vol. 31, no. 6, pp. 841–846, 1996.
- M.W. Allam and M.I. Elmasry, "Dynamic Current Mode Logic (DyCML): A new low-power high-performance logic style." *IEEE J. of Solid-State Circuits*, vol. 36, no. 3, pp. 550–558, 2001.
- A.M. Fahim and M.I. Elmasry, "Low-power high-performance arithmetic circuits and architectures." *IEEE J. of Solid-State Circuits*, vol. 37, no. 1, pp. 90–94, 2002.
- C. Tretz, et. al., "Performance comparison of differential static CMOS circuit topologies in SOI technology," in *Proc. IEEE International SOI Conference*, 1998, pp. 123–124.

## 190 Elrabaa

 M.E.S. Elrabaa and M.I. Elmasry, "Split-Gate Logic Circuits for Multi-Threshold Technologies." in *Proc. Int. Symp. Circuits and Systems*, 2001, pp. 798–801.



**Muhammad E.S. Elrabaa** received his B.Sc. degree in computer Engineering from Kuwait University, Kuwait in 1989, and his M.A.Sc. and PhD degrees in Electrical Engineering from the University of Waterloo, Waterloo, Canada, in 1991 and 1995, respectively. His graduate research dealt with Digital BiCMOS ICs and Low-Power circuit techniques. From 1995 till 1998, he worked as a senior circuit designer with Intel Corp., in Portland, Oregon, USA. He designed and developed low power digital circuits for Microprocessors. From 1998 till 2001 he was with the EE department, UAE University as an assistant professor. In 2001, he joined the computer Engineering department, KFUPM University. His current research interests include reconfigurable computing, low-power circuits, and communication circuits. He authored and co-authored several papers, a book and holds two US patents.