## ELEC516 Digital VLSI System Design and Design Automation (spring, 2010)

## Assignment 4 - Reference solution

## 1) Pulse-plate 1T DRAM cell

a) Timing diagrams for nodes BL and Y when writing " 0 " and " 1 "

Timing diagram of pulse-plate 1T DRAM cell

b) From figures in (a), to store ' 0 ' in this DRAM cell, voltage of node Y is 0 ; while writing ' 1 ', this voltage changes to $2 V_{d d}-V_{t h}$. The voltage difference between ' 1 ' and ' 0 ' state increases by ( $V_{d d}-V_{t h}$ ) compared with normal 1T-DRAM architecture. The profit brought by larger voltage difference includes better noise tolerance, easier design for sense amplifier and smaller leakage etc.
c) BL is pre-discharged to 0 in read operation. When reading ' 0 ', since voltage on node Y is zero initially, after asserting WL for read operation $V_{B L}=0$. When reading ' 1 ', from charge sharing equation, we've:

$$
\begin{aligned}
& \left(2 V_{d d}-V_{t}-V_{d d}\right) \times C_{\text {cell }}+0 \times C_{B L}=C_{B L} \times V_{B L}+\left(V_{B L}-V_{d d}\right) \times C_{\text {cell }} \\
& \Rightarrow V_{B L}=\frac{\left(2 V_{d d}-V_{\text {th }}\right) \times C_{\text {cell }}}{C_{\text {cell }}+C_{B L}}
\end{aligned}
$$

d) From equation in c), we can have:

$$
\because \Delta V_{B L} \geq 250 \mathrm{mV} \Rightarrow C_{\text {cell }}>13.89 f F
$$

## 2) DRAM with divided bit-line structure

a) $C_{B L}$ is contributed by dummy and DRAM cells as well as the input capacitance of sense amplifier (considering the pre-charge transistor also):

$$
C_{b i t}=256 \times 8 f F+50 f F+2 \times 8 f F=2.114 p F
$$

(note: if ignoring the pre-charge transistor, $C_{b i t}=256 \times 8 f F+50 f F+8 f F=2.106 p F$ )
b) Timing diagram is as follows:

Timing Diagram of DRAM cell


Normally, $C_{d}$ is adopted to be close to $C_{b i t}$, since we want to $V_{p r e}=\frac{\left(V_{d d}-V_{t h}\right) \times C_{b i t}}{C_{d}+C_{b i t}}$ close to $\frac{V_{d d}}{2}$.
c) From charge balance equations, for each side of sense amplifier:

$$
\begin{gathered}
C_{B L} \times V_{\text {pre }}+C_{c} \times V_{\text {store }}=\left(C_{B L}+C_{c}\right) \times\left(V_{\text {pre }}+\Delta V\right) \\
\Delta V=\frac{C_{c} \times\left(V_{\text {store }}-V_{\text {pre }}\right)}{C_{B L}+C_{c}}
\end{gathered}
$$

To read " 0 ", $V_{\text {store }-0}=0$; to read " 1 ", $V_{\text {store }-1}=V_{d d}-V_{\text {th }}$;
In ideal, we want to $\Delta V_{\text {store }-0}=\Delta V_{\text {store }-1} ; \therefore V_{\text {pre }}=\frac{1}{2}\left(V_{\text {store }-1}-V_{\text {store }-0}\right)=\frac{1}{2}\left(V_{\text {dd }}-V_{\text {th }}\right)$
The voltage difference seen by sense amplifier thus can be represented as:

$$
\begin{aligned}
& V_{B L}-V_{p r e}=\frac{1}{2} \frac{C_{c}}{C_{B L}+C_{c}}\left(V_{d d}-V_{t h}\right) \approx \frac{1}{2} \frac{C_{c} \times V_{d d}}{C_{B L}+C_{c}} \geq 60 \mathrm{mV} \\
& \Rightarrow C_{c} \geq 51.98 \mathrm{fF}
\end{aligned}
$$

## 3) Column Decoder design

Consider implementing the design using a hybrid way: (Assume pre-decode x bits) $N_{\text {dec }}=N_{\text {pre }}+N_{\text {pass }}+N_{\text {tree }}=(x+1) \times 2^{x}+2^{4}+2\left(2^{4-x}-1\right)$;
Since the number of serial pass transistors are restricted to 3 . thus x can be 2,3,4;

$$
\begin{aligned}
& x=2 \Rightarrow N_{\text {dec }}=46 \\
& x=3 \Rightarrow N_{\text {dec }}=50 \\
& x=4 \Rightarrow N_{\text {dec }}=96
\end{aligned}
$$

Thus, the most efficient implementation is pre-decoding 2 bits:
The reference schematic is as follows:



## 4) MSE Processing Element design

a) two's complement subtraction on 8bit data:

b) One possible design can be as follows:


Delay:

$$
D=d_{i n v}+d_{8 b i t \_a d d e r}+d_{i n v}+d_{E X-O R}+d_{12 b i t \_a d d e r}=7.6 \mathrm{~ns}
$$

According to timing requirement:
$T \geq T_{\text {clk }-Q}+D+T_{\text {setup }}=0.3+7.6+0.1=8 n s$ This corresponds to 125 MHz .
c) $P_{M}=S C \times V^{2} \times f$
d) For pipelining case, since we reduce the critical path, the maximum clock frequency can be increased. On the other hand, if we maintain the same clock frequency which corresponds to a fixed performance requirement. We can reduce the supply voltage to still meet the delay requirement. From c), by reducing supply voltage, power reduction can be obtained.
For parallel case, although SC increases due to duplicating some hardware in the design. The throughput also increases. For the fixed performance requirement, we can reduce clock frequency and supply voltage. From c), we can observe $P_{M}$ decreases.
e) For design in b), we have :

$$
\begin{aligned}
& P_{M}=\left(C_{8 b i t \text { adder }}+9 \times C_{i n v}+8 \times C_{\text {xor }}+C_{12 b i t \_a d d e r}+39 \times C_{\text {reg }}\right) \times V^{2} \times f \\
& \Rightarrow P_{M}=(1200+180+240+2000+1560) \times 10^{-15} \times 25 \times 125 \times 10^{6}=16.2 \mathrm{~mW}
\end{aligned}
$$

f) The new design is supposed to be as follows:


Delay in stage one is: $D_{1}=t_{\text {inv }}+t_{8 b i t \_a d d e r}+t_{\text {inv }}+t_{\text {xor }}=0.4+2.8+0.4=3.6 \mathrm{~ns}$
Delay in stage two is: $D_{2}=t_{12 b i t \_a d d e r}=4 n s$
Maximum clock frequency is: $T_{c l k} \geq \max \left(D_{1}, D_{2}\right)+t_{\text {setup }}+t_{c k-q}=4.4 n s$
If we still want to run in 125 MHz , the new voltage is :

$$
\frac{V_{\text {new }}}{5 \mathrm{~V}}=\frac{4.4 n s}{8 n s} \Rightarrow V_{\text {new }}=2.75 \mathrm{~V}
$$

The new power consumption thus can be formulated:

$$
\begin{aligned}
& P_{M_{\_} \text {new }}=S C_{\text {new }} \times V_{\text {new }}{ }^{2} \times f=\left(S C_{\text {old }}+20 \times C_{\text {reg }}\right) \times V_{\text {new }}{ }^{2} \times f \\
& \Rightarrow(5180+800) \times 2.75^{2} \times 125=5.65 \mathrm{~mW}
\end{aligned}
$$

## 5) Design for testability

a) Test vector (X,X,0,0,0)
b) The stuck-at- 1 fault of node $y$ can also be found by same vector

6) Low Power design
a) For transition based encoding, (Assume initial state is 0000000000 )

| Data vector | Transition based code |  |
| :--- | :--- | :--- |
| 0000100100 | 0000100100 | (2 transitions) |
| 1110101011 | 1110001111 | (7 transitions) |
| 0110010100 | 1000111111 | (4 transitions) |
| 0110100100 | 0000110000 | (5 transitions) |
| 0111010100 | 0001110000 | (1 transitions) |
| 1000101001 | 111111101 | (6 transitions) |
| 0101000100 | 1101101101 | (2 transitions) |
| 1010011000 | 1111011100 | (4 transitions) |
| 1010010000 | 000001000 | (6 transitions) |

Total number of transitions are : 37
$E=\frac{1}{2} C V^{2} \times S W=37 \times \frac{1}{2} \times 1 p f \times 1 V^{2}=18.5 p J$
b) For active high coding:

Total number of transitions are : 45
$E=\frac{1}{2} C V^{2} \times S W=45 \times \frac{1}{2} \times 1 p f \times 1 V^{2}=22.5 p J$
c) For redundant coding:

| Data vector | Redundant code | Inv bit |  |
| :--- | :--- | :--- | :--- |
| 0000100100 | 0000100100 | $0 \quad$ (2 transitions) |  |
| 1110101011 | 0001010100 | 1 | $(4$ transitions) |
| 0110010100 | 0110010100 | $0 \quad$ ( 4 transitions) |  |
| 0110100100 | 0110100100 | 0 | (2 transitions) |
| 0111010100 | 0111010100 | 0 | (3 transitions) |
| 1000101001 | 0111010110 | 1 | $(2$ transitions) |
| 0101000100 | 0101000100 | $0 \quad$ (4 transitions) |  |
| 1010011000 | 0101100111 | $1 \quad$ (4 transitions) |  |
| 1010010000 | 0101101111 | $1 \quad$ ( 1 transitions) |  |

Total number of transitions : 26 ;
$E=\frac{1}{2} C V^{2} \times S W=26 \times \frac{1}{2} \times 1 p f \times 1 V^{2}=13 p J ; 42.2 \%$ energy saving can be obtained

(block diagram of encoder and decoder)

