|Nicholas Fraser, AMD Xilinx
|LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications
|Aaron Zhao, Imperial College London, U.K.
|On the Opportunities and Challenges of Hardware-aware Automated Machine Learning
|Gary Robinson, Groq
|Running Scalable Applications on Groq’s AI / HPC Platform
|Alexander Montgomerie-Corcoran, Imperial College London, U.K.
|fpgaConvNet & SAMO: Model-Specific Optimisation of Convolutional Neural Network Accelerators onto Field Programmable Gate Arrays
|Partha Maji, Tenstorrent
|Implementing GNNs in modern spatial accelerators - challenges and opportunities
9:05-9:40 : Nicholas Fraser, AMD XilinxTitle:
LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput ApplicationsAbstract:
Machine learning algorithms have been gradually displacing traditional programming techniques across multiple domains, including domains that would require extreme high-throughput data rates, such as telecommunications and network packet filtering. Although high accuracy has been demonstrated, very few works have shown how to run these algorithms with such high-throughput constraints. To address this, we propose LogicNets - a co-design method to construct a neural network and inference engine at the same time. We create a set of layer primitives called neuron equivalent circuits (NEQs) which map neural network layers directly to the hardware building blocks (HBBs) available on an FPGA. From this, we can design an execute networks with low activation bitwidth and high sparsity at extremely high data rates and low latency, while only using a small amount of FPGA resources.Bio:
Nicholas J. Fraser received the PhD degree at The University of Sydney, Australia in 2020. Currently he's a senior research scientist at AMD AECG Research Labs, Dublin, Ireland. His main research interests include: training of reduced precision neural networks, software / hardware co-design of neural network topologies / accelerators, and audio signal processing.
9:40-10:15 : Yiren (Aaron) Zhao, Imperial College LondonTitle:
On the Opportunities and Challenges of Hardware-aware Automated Machine LearningAbstract:
DNNs are becoming increasingly common in production applications. When a network starts to handle a great amount of data, the major concerns stay on its efficiency. Automated Machine Learning, or Neural Network Architecture Search in particular, focuses on finding the optimal network topology for the targeting dataset, and recent studies reviewed that this method can be edited to involve hardware awareness, eg. taking hardware-specific latency as the optimization target. In this talk I will show how Automated Machine Learning (AutoML) methods can discover network topologies with both better runtime performance and accuracy for emerging network types and new learning scenarios. I will mainly discuss the research opportunities and challenges for us to build the next-generation AutoML methods.Bio:
Aaron Zhao is a lecturer at Imperial College London and a visiting researcher at the University of Cambridge. He works in the intersections of Hardware, Security and Machine Learning. He is educated at Imperial College London (BEng in EEE) and Cambridge (Mphil, PhD in Computer Science). He is a recipient of junior research fellowship (JRF) St John’s College Cambridge in 2021, and Apple Scholar in AI/ML award in 2020.
10:45-11:20 : Gary Robinson, GroqTitle:
Running Scalable Applications on Groq’s AI / HPC PlatformAbstract:
GroqChip™, based on a Tensor Streaming Processor architecture, is an accelerator chip that offers massive compute performance for AI-based applications formulated as tensor operations, as well as linear algebra in general. Its deterministic dataflow execution model results in predictable performance without runtime variation, while RealScale™ chip-to-chip interconnect technology makes it possible to achieve near-linear scaling from card to node to rack. In this talk, we first describe Groq IO Accelerator, an FPGA-based system to extend the capabilities of GroqChip with application specific accelerator blocks. A real-time image classification application is mapped to the GroqChip, while the FPGA-based IO Accelerator handles decoding and preprocessing of image data from a real time video stream. Information exchange via RealScale chip-to-chip interconnect ensures that communication between GroqChip and Groq IO Accelerator does not become the bottleneck. Furthermore, we describe how GroqChip, originally geared towards AI applications, can also deliver very high performance for linear algebra-based applications in HPC. A seismic imaging application, based on the finite difference method, is mapped to Groq architecture. The original stencil-based algorithm is transformed into tensor operations that efficiently leverage the computational power of GroqChip. The deterministic dataflow model supports efficient orchestration of data movements within the chip and between chips without ever stalling GroqChip compute units. Finally, numerical analysis and optimization allow efficient leverage of Groq TruePoint™ arithmetic to satisfy the numerical requirements of seismic imaging.Bio:
Gary Robinson is a Fellow at Maxeler Technologies. Before that he was head of Maxeler's Analytics team, responsible for developing Maxeler's real time risk solutions including real time XVA and Value at Risk (VaR) running on FPGA and a library for pricing and calculating risk for financial derivatives. Gary has over 13 years of experience developing applications on FPGAs at Maxeler in areas such as seismic image processing and finance including the award winning project with JP Morgan which reduced the time taken for their structured credit derivatives pricing and risk calculations from 8 hours to under 2 minutes. Gary holds an MEng in Software Engineering from Imperial College.
11:20-11:55 : Alexander Montgomerie-Corcoran, Imperial College LondonTitle:
fpgaConvNet & SAMO: Model-Specific Optimisation of Convolutional Neural Network Accelerators onto Field Programmable Gate ArraysAbstract:
Field Programmable Gate Array (FPGA) devices are a promising platform for deploying Convolutional Neural Network (CNN) models in a wide range of settings, from embedded systems to datacenters. However, the large design space associated with an FPGA and CNN pair makes it difficult to find an optimal mapping for a specific model onto a specific FPGA. To address this problem, first we will introduce a set of fundamental hardware building blocks with adjustable performance parameters that can be used to construct a model-specific CNN accelerator. We will then discuss how the performance and resource models of these building blocks can be created to facilitate rapid design space exploration. Finally, we will describe how heuristic-based solvers can be used alongside the hardware models and platform constraints to find optimal mappings for FPGA-CNN pairs, as well as the automated design process of deploying the optimised model onto the chosen FPGA device.Bio:
Alexander Montgomerie-Corcoran received his MEng from Imperial College London in 2019 and is currently pursuing a PhD under the supervision of Dr. Christos-Savvas Bouganis at Imperial College London within the Circuits and Systems group. His research interests include FPGA acceleration of Convolutional Neural Networks and low-power methods for FPGAs.
11:55-12:30 : Partha Maji, TenstorrentTitle:
Implementing GNNs in modern spatial accelerators - challenges and opportunitiesAbstract:
Graph Neural Networks are becoming an important workload in many domains including recommender systems, molecular drug discovery, study of brains, transportation networks, scene understanding and in NLP. However, the GNN workload is very sparse unlike in other type of neural networks such as convolutional neural net and transformers. GNN has the property that does now allow easy exploitation of memory locality. Therefore accelerating GNN workload tend to be very challenging in hardware. In this talk, we we will explore various challenges in commonly used GNN baseline architectures and explore opportunities to accelerate GNN on emerging accelerators.Bio:
Partha Maji is a Technical Director of ML at Tenstorrent, where he drives research in hardware efficient Graph Neural Network implementations. Before that he was head of ML research at Arm UK. Partha has over 16 years of experience in the industry that spans multiple inter-disciplinary domains from mobile processors, on-chip interconnect design to leading low-power SoCs teams for Set-Top-Box and televisions. Partha holds a PhD in Computer Science from University of Cambridge.