# FPGAConvNet & SAMO: Model-Specific Optimisation of Convolutional Neural Network Accelerators onto FPGAs

**Alexander Montgomerie-Corcoran** 

Intelligent Digital Systems Lab

**Dept. of Electrical and Electronic Engineering** 

www.imperial.ac.uk/idsl

Íntelligent Digital Systems Lab

## Context

## Where are CNN Models deployed?



# **Datacenter:** High Throughput



Edge: Low Latency



## **Motivation**



Intelligent Digital Systems Lab

## **Motivation**



## **FPGAConvNet:**

- FPGA-based Accelerator
- Streaming Architecture
- Model-specific
- Automated Compilation

github.com/AlexMontgomerie/fpgaconvnet-hls github.com/AlexMontgomerie/fpgaconvnet-model

| AlexMontgomerie / fpgaconvnet-hls Public |        |                            |                       |                |                    |  |  |
|------------------------------------------|--------|----------------------------|-----------------------|----------------|--------------------|--|--|
| $\langle \rangle$                        | Code   | <ol> <li>Issues</li> </ol> | រ៉ៀ Pull requests     | Actions        | ① Security         |  |  |
|                                          | ۲<br>۲ |                            |                       |                |                    |  |  |
|                                          | ۹      | AlexMontgom                | erie fixes bug with c | conv module wh | ere both filters a |  |  |
|                                          |        | fpgaconvnet                |                       | fixes bug      | g with conv modu   |  |  |
|                                          |        | test                       |                       | fixes bug      | g with conv mode   |  |  |
|                                          | Ľ      | .gitignore                 |                       | removed        | d potentially redu |  |  |
|                                          | Ľ      | LICENSE                    |                       | Create L       | ICENSE             |  |  |
|                                          | Ľ      | MANIFEST.in                |                       | python         | wrapper for parti  |  |  |
|                                          | Ľ      | README.md                  |                       | Update         | README.md          |  |  |
|                                          | ۵      | setup.py                   |                       | added n        | nodule testbench   |  |  |

| AlexMontgomerie / samo Public |                                  |                            |                |  |  |  |  |
|-------------------------------|----------------------------------|----------------------------|----------------|--|--|--|--|
| <> (                          | Code                             | ⊙ Issues ী Pull requests   | 🕛 Security 🛛 🗠 |  |  |  |  |
|                               | પ                                | main 🚽 🐉 4 branches 🕟 0 ta | ags            |  |  |  |  |
|                               | AlexMontgomerie Update README.md |                            |                |  |  |  |  |
|                               |                                  | configs                    | try resnet50   |  |  |  |  |
|                               |                                  | docs                       | started on do  |  |  |  |  |
|                               |                                  | platforms                  | added reconi   |  |  |  |  |
|                               |                                  | samo                       | made main s    |  |  |  |  |
|                               |                                  | scripts                    | script workin  |  |  |  |  |
|                               | ۵                                | .gitignore                 | started on do  |  |  |  |  |
|                               | ۵                                | LICENSE                    | Create LICEN   |  |  |  |  |
|                               | ۵                                | README.md                  | Update REAE    |  |  |  |  |

## SAMO:

- Design-Space Exploration
- Streaming-Architecture Specific
- Backend Agnostic (FINN, HLS4ML, FPGAConvNet)



github.com/AlexMontgomerie/samo

# Background

- Systolic Array Architectures
- Streaming Architectures

7

Íntelligent Digital Systems Lab

## **Background: CNN Accelerators**

**Systolic Array:** 



- General Purpose
- Matrix Multiplication Engines
- Single design used across all layers of the CNN Model
- Small design space
- Requires data reordering (im2col)

(EYERISS, TPU, ...)

## **Background: CNN Accelerators**

### **Streaming Architecture:**



- Hardware is **customized** for a specific CNN Model
- All layers of the CNN Model are **pipelined** together
- Large Design Space

(FINN, HLS4ML, FPGAConvNet)

# **FPGAConvNet**

- Hierarchy
- Layers
- Modules
- Performance Parameters
- Modelling

Intelligent Digital Systems Lab

## **Hierachy**



# **Hierachy**



# **Hierachy**



Intelligent Digital Systems Lab

# **Hierachy**



# **Hierachy**



Intelligent Digital Systems Lab

# Convolution



# **Sliding Window**

- Produces consecutive kernelsized windows
- Requires no data re-ordering

 $\overline{c=0} \ k_{\chi} = 0 \ k_{\nu} = 0$ 

Fully pipelined



## **Modules: Sliding Window**







## Modules: Sliding Window



# **Vector Dot Product**

- Accepts windows of feature-map and weights
- Performs a Vector Dot Product on these flattened windows
- Fully pipelined

$$O(f, x, y) = \sum_{c=0}^{C} \sum_{k_x=0}^{K} \sum_{k_y=0}^{K} I(c, x + k_x, y + k_y) \cdot W(c, k_x, k_y, f)$$



## **Modules: Accumulate**

# Accumulate

- Accumulates across the **channel** dimension
- Fully pipelined



$$O(f, x, y) = \sum_{c=0}^{C} \sum_{k_x=0}^{K} \sum_{k_y=0}^{K} I(c, x + k_x, y + k_y) \cdot W(c, k_x, k_y, f)$$

## **Layers: Convolution**



# **Parametrisation**

## How do we improve performance?

- Vector Dot Product Folding
- Input Channel Folding
- Output Channel Folding







# **Channel Folding** Out Streams **Increase Streams** Increase

#### Íntelligent Digital Systems Lab

25

# Input Channel Folding



# **Output Channel Folding**



# Modelling

- Need high-level models for **Design Space Exploration**
- Avoids **Synthesis**, which can make DSE intractable
- Modelling **Performance** and **Resources** for objective and platform constraints

## **Modelling: Performance**

- Model performance based on Synchronous Dataflow (SDF) Graph model
- For all Acyclic SDF Graphs, the performance is dictated by the slowest node
- This assumes **back-pressure** and sufficient **buffer sizes** between layers



## **Modelling: Resources**

## **DSP** and **BRAM**:

• Deterministic models

## LUT and FF:

- **Regression** models
- Variations in **P&R** as well as **HLS**



# **SAMO Framework**



Intelligent Digital Systems Lab

## **SAMO: Framework (parser)**



- Maps the **CNN Model** graph to the **Hardware Description** graph
- First converts the CNN **Model** to the Streaming Architectures representation
  - A wrapper that maps the custom representation to the abstract **Hardware Description** Graph



## **SAMO: Hardware Description Graph**

**CNN Model:** has *L* layers can be described as a graph *M* with edges  $E_M$ ,

**Hardware Description Graph:** *H* with edges  $E_H$  which has *N* nodes can be described as,

CNN Model -> HD Graph:

 $S(M, E_M) \mapsto H, E_H$ 

Backends: FINN, HLS4ML, FPGAConvNet

Frontends: ONNX, Keras, Tensorflow



## **SAMO: Framework (optimiser)**



- Solves the optimization problem on the Hardware Description graph
- Two optimisers are currently implemented:
  - Rule Based
  - Simulated Annealing

## **SAMO: Performance Parameters**

## Input Channel Folding:

parallelism across the channel dimension of the incoming feature map

### **Output Channel Folding:**

parallelism across the channel dimension of the outgoing feature map



## **SAMO: Performance Parameters**



Íntelligent Digital Systems Lab

#### **SAMO: Performance Parameters**



Intelligent Digital Systems Lab

# **SAMO:** Partitioning



## **SAMO:** Partitioning



## **SAMO:** Partitioning







## **SAMO:** Partitioning





#### Intelligent Digital Systems Lab

42

# **SAMO: Constraints**

#### System:

Constraints on the system, relating to the FPGA and Off-Chip memory.

- Resources (LUT, DSP, BRAM, FF)
- Memory Bandwidth

#### **Structure:**

Constraints on the performance variables to ensure functionality.

- Inter folding matching
- Intra folding matching





# **SAMO: Objective**

Latency of Partition:

= Latency of Slowest Layer

Latency of Network:

= sum(Latency of Partitions) +
No. Partitions x Reconfiguration Time

#### **Objectives:**

- Throughput
- latency





#### **SAMO: Design Space Exploration**



(Iterations of Simulated Annealing)

Íntelligent Digital Systems Lab

# **SAMO: Framework (exporter)**



- The Hardware Description graph is converted back to the Streaming Architecture's representation, with the tuned parameters
- This can then be used to generate a bitstream following the Streaming Architecture's design flow

# **Evaluation**

- Design Space Exploration
- SAMO Comparison
- FPGAConvNet Performance



#### **Evaluation: Design Space Exploration**



CNV on a U250 using FINN

- **Rule-Based:** deterministic behavior and exploration time
- Simulated-Annealing: stochastic behaviour and exploration time
- Comparable quality of designs
- Runtime usable within NAS context

## **Evaluation: SAMO Comparison**



- Comparison to hand-tuned designs
- **Never below** hand-tuned performance
- Found between **4x to 20x improvement**
- Free performance!

# **Evaluation: FPGAConvNet Performance**

**FINN** : Streaming Architecture

Wei et al. : Systolic Array

- Comparison across range of networks
- Accuracy vs Performance trade-off
- FINN is heavily quantised

Evaluated on ...

- U250 for FINN and FPGAConvNet

- Arria 10 for Wei et al.



Xuechao Wei et al. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17).

Íntelligent Digital Systems Lab



Conclusion

# Thank you for listening!



AlexMontgomerie/samo

AlexMontgomerie/fpgaconvnet-hls

AlexMontgomerie/fpgaconvnet-model



am9215@ic.ac.uk

Intelligent Digital Systems Lab