

## DSP Design Techniques for Best Performance, Power and Cost

Niall Battson DSP Divisional Marketing



DSP Design Techniques 2

# Agenda

- Virtex-4 Family
- The Xtreme DSP Slice
- Filter Techniques
- Case Studies
  - Digital Up Converter
  - 2**-**D



# **The Virtex-4 Family**



DSP Design Techniques 4

## **Virtex-4 Family**

Virtex-4 is the first Xilinx family to introduce three separate platforms optimized for different application domains. This fundamental shift provided the greatest silicon efficiency and optimal cost.

#### Virtex-4 LX Platform

Optimized for high-performance Logic

#### Virtex-4 FX Platform

Optimized for Embedded Processing and high-speed Serial Connectivity

#### Virtex-4 SX Platform

Optimized for high-performance Signal Processing



# Virtex-4 SX



The SX family emphasizes Xilinx commitment to **DSP** applications by providing a strong skew toward to dedicated arithmetic units versus logic.



DSP Design Techniques 6

#### The XtremeDSP<sup>™</sup> Slice (also known as "DSP48")



DSP Design Techniques 7





DSP Design Techniques 8



#### 500 MHz maximum frequency in the fastest speed grade



DSP Design Techniques 9



#### 500 MHz maximum frequency in the fastest speed grade

**EXILINX** 

DSP Design Techniques 10





DSP Design Techniques 11

www.xilinx.com/dsp

**XILINX** 



500 MHz maximum frequency in the fastest speed grade



www.xilinx.com/dsp

XILINX

#### Dynamically Reconfigurable DSP OPMODEs

| OnMada                              |   | Z |   | Y |   | ) | ( | Output                          |  |
|-------------------------------------|---|---|---|---|---|---|---|---------------------------------|--|
| Opwode                              | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Output                          |  |
| Zero                                | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +/- Cin                         |  |
| Hold P                              | 0 | 0 | 0 | 0 | 0 | 1 | 0 | +/- (P + Cin)                   |  |
| A:B Select                          | 0 | 0 | 0 | 0 | 0 | 1 | 1 | +/- (A:B + Cin)                 |  |
| Multiply                            | 0 | 0 | 0 | 0 | 1 | 0 | 1 | +/- (A * B + Cin)               |  |
| C Select                            | 0 | 0 | 0 | 1 | 1 | 0 | 0 | +/- (C + Cin)                   |  |
| Feedback Add                        | 0 | 0 | 0 | 1 | 1 | 1 | 0 | +/- (C + P + Cin)               |  |
| 36-Bit Adder                        | 0 | 0 | 0 | 1 | 1 | 1 | 1 | +/- (A:B + C + Cin)             |  |
| P Cascade Select                    | 0 | 0 | 1 | 0 | 0 | 0 | 0 | PCIN +/- Cin                    |  |
| P Cascade Feedback Add              | 0 | 0 | 1 | 0 | 0 | 1 | 0 | PCIN +/- (P + Cin)              |  |
| P Cascade Add                       | 0 | 0 | 1 | 0 | 0 | 1 | 1 | PCIN +/- (A:B + Cin)            |  |
| P Cascade Multiply Add              | 0 | 0 | 1 | 0 | 1 | 0 | 1 | PCIN +/- (A * B + Cin)          |  |
| P Cascade Add                       | 0 | 0 | 1 | 1 | 1 | 0 | 0 | PCIN +/- (C + Cin)              |  |
| P Cascade Feedback Add Add          | 0 | 0 | 1 | 1 | 1 | 1 | 0 | PCIN +/- (C + P + Cin)          |  |
| P Cascade Add Add                   | 0 | 0 | 1 | 1 | 1 | 1 | 1 | PCIN +/- (A:B + C + Cin)        |  |
| Hold P                              | 0 | 1 | 0 | 0 | 0 | 0 | 0 | P +/- Cin                       |  |
| Double Feedback Add                 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | P +/- (P + Cin)                 |  |
| Feedback Add                        | 0 | 1 | 0 | 0 | 0 | 1 | 1 | P +/- (A:B + Cin)               |  |
| Multiply-Accumulate                 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | P +/- (A * B + Cin)             |  |
| Feedback Add                        | 0 | 1 | 0 | 1 | 1 | 0 | 0 | P +/- (C + Cin)                 |  |
| Double Feedback Add                 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | P +/- (C + P + Cin)             |  |
| Feedback Add Add                    | 0 | 1 | 0 | 1 | 1 | 1 | 1 | P +/- (A:B + C + Cin)           |  |
| C Select                            | 0 | 1 | 1 | 0 | 0 | 0 | 0 | C +/- Cin                       |  |
| Feedback Add                        | 0 | 1 | 1 | 0 | 0 | 1 | 0 | C +/- (P + Cin)                 |  |
| 36-Bit Adder                        | 0 | 1 | 1 | 0 | 0 | 1 | 1 | C +/- (A:B + Cin)               |  |
| Multiply-Add                        | 0 | 1 | 1 | 0 | 1 | 0 | 1 | C +/- (A * B + Cin)             |  |
| 17-Bit Shift P Cascade Select       | 1 | 0 | 1 | 0 | 0 | 0 | 0 | Shift(PCIN) +/- Cin             |  |
| 17-Bit Shift P Cascade Feedback Add | 1 | 0 | 1 | 0 | 0 | 1 | 0 | Shift(PCIN) +/- (P + Cin)       |  |
| 17-Bit Shift P Cascade Add          | 1 | 0 | 1 | 0 | 0 | 1 | 1 | Shift(PCIN) +/- (A:B + Cin)     |  |
| 17-Bit Shift P Cascade Multiply Add | 1 | 0 | 1 | 0 | 1 | 0 | 1 | Shift(PCIN) +/- (A * B + Cin)   |  |
| 17-Bit Shift P Cascade Add          | 1 | 0 | 1 | 1 | 1 | 0 | 0 | Shift(PCIN) +/- (C + Cin)       |  |
| 17-Bit Shift P Cascade Add Add      | 1 | 0 | 1 | 1 | 1 | 1 | 1 | Shift(PCIN) +/- (A:B + C + Cin) |  |
| 17-Bit Shift Feedback               | 1 | 1 | 0 | 0 | 0 | 0 | 0 | Shift(P) +/- Cin                |  |
| 17-Bit Shift Feedback Feedback Add  | 1 | 1 | 0 | 0 | 0 | 1 | 0 | Shift(P) +/- (P + Cin)          |  |
| 17-Bit Shift Feedback Add           | 1 | 1 | 0 | 0 | 0 | 1 | 1 | Shift(P) +/- (A:B + Cin)        |  |
| 17-Bit Shift Feedback Multiply Add  | 1 | 1 | 0 | 0 | 1 | 0 | 1 | Shift(P) +/- (A * B + Cin)      |  |
| 17-Bit Shift Feedback Add           | 1 | 1 | 0 | 1 | 1 | 0 | 0 | Shift(P) +/- (C + Cin)          |  |

- Over 40 Different Modes
- Each XtremeDSP Slice individually controllable
- Change operation in a single clock cycle
- Enables resource sharing for maximum utilization



#### **Complex Multiplier**

(a+jb).(c+jd) = [a.c - b.d] + j[b.c + a.d]



| CLK Cycle | Function | OPMODE | Sub | Sel1 | Sel2 | CE_R | CE_I |
|-----------|----------|--------|-----|------|------|------|------|
|           |          |        |     |      |      |      |      |
|           |          |        |     |      |      |      |      |
|           |          |        |     |      |      |      |      |
|           |          |        |     |      |      |      |      |

**NOTE:** Control signals can be stored in a Disributed Memory

Performance 400 Mhz Size: 1 XDSP Slice 59 Slices (5 for control)



DSP Design Techniques 14



| CLK Cycle | Function          | OPMODE  | Sub | Sel1 | Sel2 | CE_R | CE_I |
|-----------|-------------------|---------|-----|------|------|------|------|
| 1         | Multiply Subtract | 0001010 | 1   | 0    | 0    | 0    | 1    |
|           |                   |         |     |      |      |      |      |
|           |                   |         |     |      |      |      |      |
|           |                   |         |     |      |      |      |      |

**NOTE:** Control signals can be stored in a Disributed Memory

Performance 400 Mhz Size: 1 XDSP Slice 59 Slices (5 for control)





| CLK Cycle | Function            | OPMODE  | Sub | Sel1 | Sel2 | CE_R | CE_I |
|-----------|---------------------|---------|-----|------|------|------|------|
| 1         | Multiply Subtract   | 0001010 | 1   | 0    | 0    | 0    | 1    |
| 2         | Multiply Accumulate | 0101010 | 0   | 1    | 0    | 0    | 0    |
|           |                     |         |     |      |      |      |      |
|           |                     |         |     |      |      |      |      |

**NOTE:** Control signals can be stored in a Disributed Memory

| Performance     |
|-----------------|
| 400 Mhz         |
| Size:           |
| 1 XDSP Slice    |
| 59 Slices       |
| (5 for control) |



**Complex Multiplier** 



| CLK Cycle | Function            | OPMODE  | Sub | Sel1 | Sel2 | CE_R | CE_I |
|-----------|---------------------|---------|-----|------|------|------|------|
| 1         | Multiply Subtract   | 0001010 | 1   | 0    | 0    | 0    | 1    |
| 2         | Multiply Accumulate | 0101010 | 0   | 1    | 0    | 0    | 0    |
| 3         | Multiply            | 0001010 | 0   | 0    | 1    | 1    | 0    |
|           |                     |         |     |      |      |      |      |

Performance 400 Mhz Size: 1 XDSP Slice 59 Slices (5 for control)

**NOTE:** Control signals can be stored in a Disributed Memory

by : Niall Battson (Xilinx) 2004

XILINX



| CLK Cycle | Function            | OPMODE  | Sub | Sel1 | Sel2 | CE_R | CE_I |
|-----------|---------------------|---------|-----|------|------|------|------|
| 1         | Multiply Subtract   | 0001010 | 1   | 0    | 0    | 0    | 1    |
| 2         | Multiply Accumulate | 0101010 | 0   | 1    | 0    | 0    | 0    |
| 3         | Multiply            | 0001010 | 0   | 0    | 1    | 1    | 0    |
| 4         | Multiply Accumulate | 0101010 | 0   | 1    | 1    | 0    | 0    |

NOTE: Control signals can be stored in a Disributed Memory

| Performance     |
|-----------------|
| 400 Mhz         |
| Size:           |
| 1 XDSP Slice    |
| 59 Slices       |
| (5 for control) |

pared by : Niall Battson (Xilinx) 20

### **XtremeDSP Slice Cascade**



epared by : Niall Battson (Xilinx) 20

DSP Design Techniques 19

#### **DSP48 Slice Power Consumption**



Conditions: TT, 25C, nominal voltage, Fully pipelined multiply-add mode, random vectors

DSP Design Techniques 20

#### **DSP48 Power Test for 63 Tap FIR Filter**

#### (Stratix II EP2S60 and Xilinx Virtex-4 XC4VLX60)

|                               |                                                                                                                                                                                                                                                                                                                                                                      | -             | Stratix II EP2S60 vs Virtex-4 XC4VLX60 Total Power from V <sub>CCINT</sub>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Description                   | Test using 63 section<br>asymmetrical taps with 18 bit<br>data stream and fixed 18 bit<br>coefficients. Virtex-4 uses 63<br>DSP48 blocks all in a single<br>column. Stratix II uses 4 tap<br>sections in a DSP block.<br>Reconciling summation of 4<br>tap chunks is handle by<br>Stratix II 3 input adders in<br>layers of 6, 2, and 1. Same<br>Stimulus VHDL code. | Power (Watts) | Power vs. Frequency at 85 C (63 Tap FIR Filter)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Virtex-4 Logic<br>Functions   | 64 DSP48 and 0 Slices (1 DSP<br>Block used as stimulus of the<br>filter)                                                                                                                                                                                                                                                                                             |               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Stratix II Logic<br>Functions | 128 9 Bit DSP Elements and<br>187 ALMs (1/4 of 1 DSP Block<br>is used as a stimulus for the<br>filter)                                                                                                                                                                                                                                                               | Power (Watts  | 0.5 Virtex-4 LX60<br>0.3 0.0 Virtex-4 LX60<br>0.4 0 Virtex-4 LX60<br>0.5 0 Virtex-4 LX60<br>0.7 0 Virtex-4 Virtex-4 LX60<br>0 Virtex-4 Virtex- |
|                               |                                                                                                                                                                                                                                                                                                                                                                      |               | 0 50 100 150 200 250 300                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |

> 1 Watt of power difference between Virtex-4 and Stratix-II in DSP Applications

ared by : Niall Battson (Xilinx) 2004

Frequency (MHz)

DSP Design Techniques 21

#### **Measured 1024 Point FFT Power**

(Stratix II EP2S60 and Xilinx Virtex-4 XC4VLX60)



# **Filter Techniques**

For this analysis the software used was:

ISE 7.1.1i Quartus 4.1 sp1 FIR Compiler 3.2.1



DSP Design Techniques 23

### **The FIR Filter**

The most common DSP function implemented in a Xilinx devices is the Finite Impulse Response filter:



How do we implement these filter in Virtex-4?



### **Sequential FIR Filters**



- Firstly consider one multiplier based FIR Filters.
- Processing of the filter coefficients is done in a sequential fashion
- Line is where this architecture can no longer meet performance requirements
- Line has been raised in Virtex-4 due to higher clock performance of Xtreme DSP Slice



DSP Design Techniques 25

### **Virtex-4 MAC FIR Filter**

*Filter Specification:* Sampling Frequency = 1.2288 Mhz, Coefficients = 366



DSP Design Techniques 26

## **Stratix-II MAC FIR Filter**

*Filter Specification:* Sampling Frequency = 1.2288 Mhz, Coefficients = 366



DSP Design Techniques 27

# **Parallel FIR Filters**



- Now consider one multiplier per coefficient
- Processing of the filter coefficients is done in a parallel fashion
- Line is where this architecture is required as less than 2 clock cycles are available
- Line has been raised in Virtex-4 due to higher clock performance of Xtreme DSP Slice



# Virtex-4 Systolic FIR Filter

*Filter Specification:* Sampling Frequency = 400 Mhz, Coefficients = 23





DSP Design Techniques 29

### **Stratix-II Parallel FIR Filter**



DSP Design Techniques 30

# **Semi-Parallel FIR Filters**



- Now consider in between scenario. Multiple coefficients per multiplier (M).
- Processing of the filter coefficients is done in a semiparallel fashion
- Boundary lines determined by the other techniques
- Line has been raised in Virtex-4 due to higher clock performance of Xtreme DSP Slice



## Virtex-4 4 Multiplier Systolic SP FIR

*Filter Specification:* Sampling Frequency = 74.176 MHz, Coefficients = 16



ared by : Niall Battson (Xilinx) 2004

XILINX

DSP Design Techniques 32

# **4 Multiplier Systolic SP FIR**

System Generator Implementation



#### Stratix-II

#### **4 Multiplier Semi-Parallel FIR**

*Filter Specification:* Sampling Frequency = 74.176 MHz, Coefficients = 16



DSP Design Techniques 34

# Multi-Channel Multi-Rate FIR Filters



DSP Design Techniques 35

#### Virtex-4 Multi-Channel Multi-Rate FIR

Filter Specification:

Input Frequency = 3.84 Mhz, Coefficients = 192, Interpolation Rate Change = 2, Channels = 8, Data Width = 12-bit, Coefficient Width = 15-bits



Slicing up the Pie:

Total number of coefficients = 1536

96 x 16 is the coefficient Matrix:

#### Option 1:

16 Sequential MACC Engines

96 clk cycles, Clock Speed: 368 Mhz

#### Option 2:

1 Semi-parallel 12 Multiplier FIR

8 cycle per phase, 16 phases = 128 clk cycles

Clock Speed: 491.52 Mhz

Option 3: Increase coefficients to 196

1 Semi-parallel 14 Multiplier FIR

7 cycle per phase, 16 phases = 112 clk cycles

Clock Speed: 430.08 Mhz



#### Virtex-4 Multi-Channel Multi-Rate FIR



DSP Design Techniques 37

#### Stratix-II Multi-Channel Multi-Rate FIR

Filter Specification:

Input Frequency = 3.84 Mhz, Coefficients = 192, Interpolation Rate Change = 2, Channels = 8, Data Width = 12-bit, Coefficient Width = 15-bit



epared by : Niall Battson (Xilinx) 200

DSP Design Techniques 38

#### Stratix-II Multi-Channel Multi-Rate FIR

Filter Specification:

Input Frequency = 3.84 Mhz, Coefficients = 192, Interpolation Rate Change = 2, Channels = 8, Data Width = 12-bit, Coefficient Width = 15-bit



ared by : Niall Battson (Xilinx) 2004

DSP Design Techniques 39

## **Case Study 1: DUC**

DUC Specification: Output Frequency = 450 MSPS DDS: SFDR = 84dB, CIC: 5 Stage, Interpolation Rate = 1:16, CFIR: 32 Coefficients, Interpolation Rate = 1:2, PFIR: 64 Coefficients, Interpolation Rate 1:2





# **DUC: Sysgen Model**



DSP Design Techniques 41

## **Case Study: DUC**



**DUC SIZE: (V-II Pro)** 6 Embedded Mults 2,328 Flip-Flops 2,076 LUTs 10 Block RAM Performance: 202 MHz

DUC SIZE: (V-4) 27 XDSP Slice 692 Flip-Flops 977 LUTs 10 Block RAM Performance: >400 MHz



### Case Study 2: 2-D FIR

2-D FIR Specification: Frame Rate = 60 Hz, Active Frame Size = 1440 x 1080, Single Channel Separable FIRs : Sample Rate = 111.38 MSPS, 24 Tap Re-loadable, 10-bit Data, Folding Factor = 4

$$y(n,m) = \sum_{k=-N}^{k=+N} h_0(k) \left\{ \sum_{l=-N}^{l=+N} h_1(l) \cdot x(n-k,n-l) \right\}$$
$$= \sum_{l=-N}^{l=+N} h_1(k) \left\{ \sum_{k=-N}^{k=+N} h_0(l) \cdot x(n-k,n-l) \right\}$$





DSP Design Techniques 43

## Case Study 2: Vertical FIR & Line Buffer



DSP Design Techniques 44

## **Case Study: 2-D FIR**



2-D FIR SIZE: (V-II Pro) 12 Embedded Mults 1,325 Flip-Flops 890 LUTs **30 Block RAM** Performance: 229.8 MHz Performance: 446 MHz

**2-D FIR SIZE: (V-4) 15 XDSP Slice** 560 Flip-Flops 414 LUTs 30 Block RAM





# **Conclusion - The Check List**

Is the design running as fast as possible? (500 Mhz for fastest speed grade. 50% faster that Stratix-II. Resources can be saved by making sure the design runs at full speed.)

Is the XtremeDSP Slice being utilized fully? (Fabric slices can be saved by better exploiting the Xtreme DSP Slice which leads to less power. Greater than 1 Watt less power that Stratix-II)

Are Adder Chains being used instead of trees? (The XtremeDSP Slice is designed to support adder cascades)



# **Knowledge is Power**

"The next best thing to knowing something is knowing where to find it" - Samuel Johnson



**5 Application Notes** available in the **Virtex-4 User Guide** in regard to implementation specifics

Many Reference Designs in: VHDL Verilog System Generator for DSP

For Further Details visit.....

#### www.xilinx.com/dsp



# **Knowledge is Power**

"The desire of knowledge, like the thirst of riches, increases ever with



