Abstract

Onyx is the first fully programmable accelerator for arbitrary sparse tensor algebra kernels. Unlike prior work, it supports higher-order tensors, multiple inputs, and fusion. It achieves this with a coarse-grained reconfigurable array (CGRA) that has composable memory primitives for storing compressed any-order tensors and compute primitives that eliminate ineffectual computations in sparse expressions. Further, Onyx improves dense image processing and machine learning (ML) with application-specialized compute tiles, memory tiles optimized for affine access patterns, and hybrid clock gating in the global buffer. We achieve up to 565x better energy-delay product (EDP) for sparse kernels vs. CPUs with sparse libraries, and up to 76% and 85% lower EDP for image processing and ML, respectively, vs. Amber [1].

BibTeX

@article{koul2024,
  title={Onyx: A 12nm 756 GOPS/W Coarse-Grained Reconfigurable Array for Accelerating Dense and Sparse Applications},
  author={Kalhan Koul and Maxwell Strange and Jackson Melchert and Alex Carsello and Yuchen Mei and Olivia Hsu and Taeyoung Kong and Po-Han Chen and Huifeng Ke and Keyi Zhang and Qiaoyi Liu and Gedeon Nyengele and Akhilesh Balasingam and Jayashree Adivarahan and Ritvik Sharma and Zhouhua Xie and Christopher Torng and Joel Emer and Fredrik Kjolstad and Mark Horowitz and Priyanka Raina},
  journal={IEEE Symposium on VLSI Technology and Circuits (VLSI)},
  year={2024},
  month={June}
}