Link preview
TinyTPU: SystemVerilog systolic array compiled to WASM, running live in browser - RTL golden-verified against numpy [P]
Most explanations of TPUs and systolic arrays are either hand-wavy diagrams or papers. I wanted to see the thing actually run, so I built it. TinyTPU is a 4×4 weight-stationary systolic array in real SystemVerilog, compiled to WebAssembly, with a step-by-step browser visualization. You enter two matrices, hit run, and watch the actual hardware execute: weights loading into PEs, matrix A streaming in diagonally (the "skew" that makes systolic arrays work), partial sums accumulating down the grid, results draining from the bottom. It has three levels: L1 - isolate a single MAC cell, watch one multiply-accumulate happen L2 - the full 4×4 array executing a real matmul L3 - tiling: what happens when your matrix is bigger than the hardware Nothing on screen is faked. The visualization reads state directly from compiled RTL. If you're trying to understand how matrix multiply maps to hardware why TPUs are efficient, what "weight-stationary" actually means, why the diagonal stagger exists this might click it for you in a way papers don't. Repo: tiny-tpu Live demo: Live If this project interests you please do star the repo, if you find something needs improving open a PR, I hope ya'll check this out and give me some feedback 🙏 submitted by /u/Horror-Flamingo-2150 [link] [Kommentare] reddit.com · reddit.com
Most explanations of TPUs and systolic arrays are either hand-wavy diagrams or papers. I wanted to see the thing actually run, so I built it. TinyTPU is a 4×4 weight-stationary systolic array in real SystemVerilog, compiled to WebAssembly, with a step-by-step browser visualization. You enter two matrices, hit run, and watch the actual hardware execute: weights loading into PEs, matrix A streaming in diagonally (the "skew" that makes systolic arrays work), partial sums accumulating down the grid, results draining from the bottom. It has three levels: L1 - isolate a single MAC cell, watch one multiply-accumulate happen L2 - the full 4×4 array executing a real matmul L3 - tiling: what happens when your matrix is bigger than the hardware Nothing on screen is faked. The visualization reads state directly from compiled RTL. If you're trying to understand how matrix multiply maps to hardware why TPUs are efficient, what "weight-stationary" actually means, why the diagonal stagger exists this might click it for you in a way papers don't. Repo: tiny-tpu Live demo: Live If this project interests you please do star the repo, if you find something needs improving open a PR, I hope ya'll check this out and give me some feedback 🙏 submitted by /u/Horror-Flamingo-2150 [link] [Kommentare]
Comments