UST035

8 Platform one A new generation of neural network accelerators (NNAs) is using a modular approach with smaller processing cores (writes Nick Flaherty). Imagination Technologies has launched a scalable NNA design, (also called an intellectual property or IP block) optimised for system-on-chip (SoC) designs for autonomous systems, while developer AImotive is also looking at a similar architecture. The Series4 NNA cores from Imagination have been optimised for the YOLOv3 neural network framework for processing large rectangular images rather than a general-purpose execution unit. It is aimed at developers of SoC devices for sensor fusion in high- performance Level 4 and 5 autonomous vehicles, last-mile delivery vehicles and automated street sweepers. The basic building block, called MC1, achieves 12.5 TOPS of performance through 4096 multiply accumulate units connected by a 256-bit network-on- chip (NOC). Up to eight cores can be combined in a low-latency cluster via this NOC network to provide 100 TOPS of processing, which is suitable for Level 3 autonomy, says Imagination. Multiple clusters can be placed on chip for the even higher performance that will be necessary for Level 4 and 5 autonomous operation. These clusters would be connected by a slower AXI bus. “You need to minimise the traffic between clusters, so it’s more of a system design,” said Gilberto Blanco at Imagination Technologies. “When you go to 600 TOPS you have to work with the customer to coordinate all the workloads. “With Levels 3 and 4 you need to go beyond 100 TOPS but not doing the same tasks. The heavy lifting tasks are at 40-60 TOPS with multiple tasks, and there is not a lot of data transferred between clusters,” he said. Tony King-Smith at AImotive said, “This is another dimension to the choice of distributed versus centralised processing. Because a lot of early autonomous systems have been based on centralised processors, you have all the data available but you pay a price for that – particularly with camera, Lidar and radar sensors all generating more data, so if you are transmitting that at 30 fps that’s a lot of data to move to the central processor. “Whether you have the separate cores close to the sensor or in a central cluster it is easier to upgrade them. For example, when you move from a 1 MP to 8 MP sensor, a lot more processing power is required. If you have separate engines for the front-end processing, even in the central processing, it’s easier to upgrade. “A hierarchy of NOCs creates a very unwieldy beast. It’s about balancing out how autonomous the processing is on each module. One big lump of 100 TOPs isn’t necessarily the best way to do it; it’s how you break down the problem,” King-Smith said. The Imagination Technologies NNA has been designed as part of an ISO 26262 automotive safety process, where one module can be used to monitor another to provide redundancy. Andrew Grant at Imagination Technologies said, “We have already licensed one of these cores into an SoC design.” The key to the performance of the cores comes from a technique called Tensor Tiling, which reduces the bandwidth by up to 90% by splitting input data tensors into multiple tiles for efficient data processing. This exploits local data dependencies to keep intermediate data in on-chip memory. “It’s a tiling algorithm that allows you to group the network layers, looking at the workloads and using the on-chip SRAM tightly coupled to segment the workloads and adjust for the maximum workload,” said Grant. Chip designs using the NNA have already begun, with layout and test chips on a 7 nm process with a lead customer due this year. Image processing December/January 2021 | Unmanned Systems Technology Modular accelerators The Series 4 Neural Network Accelerator (NNA) blocks are connected by a network- on-chip, and modules are linked by the standard AXI bus

RkJQdWJsaXNoZXIy MjI2Mzk4