Abstract

Onyx is a system-on-chip (SoC) with a coarse-grained reconfigurable array (CGRA) for accelerating sparse and dense tensor algebra and dense image processing and machine learning (ML) applications. To support multiple inputs, multiple dimensions, and fusion in sparse applications, Onyx utilizes composable memory primitives that operate on compressed storage and streams and compute primitives that eliminate unnecessary calculations. Onyx also improves performance on dense applications with application-specialized processing elements (PEs), area-optimized memory tiles, and hybrid clock gating in the global buffer (GLB). Onyx achieves a peak energy efficiency of 756 INT16 GOPS/W, up to better energy-delay product (EDP) for sparse kernels versus CPUs with sparse libraries, and up to 76% and 85% lower EDP for image processing and ML, respectively, versus the state-of-the-art CGRA.

BibTeX

@article{koul2025,
  title={Onyx: A 12-nm Programmable Accelerator for Dense and Sparse Applications},
  author={Kalhan Koul and Olivia Hsu and Yuchen Mei and Sai Gautham Ravipati and Maxwell Strange and Jackson Melchert and Alex Carsello and Taeyoung Kong and Po-Han Chen and Huifeng Ke and Keyi Zhang and Qiaoyi Liu and Gedeon Nyengele and Zhouhua Xie and Akhilesh Balasingam and Jayashree Adivarahan and Ritvik Sharma and Christopher Torng and Joel Emer and Fredrik Kjolstad and Mark Horowitz and Priyanka Raina},
  journal={IEEE Journal of Solid-State Circuits},
  year={2025},
  month={September}
}