| ITT: I demystify the Apple Neural Engine, and provide proof. If you've spent any time around Apple Silicon ML discussions, you've probably seen the "Neural Engine" referenced as this discrete, mysterious coprocessor sitting on the die — a black box that CoreML talks to, separate from the CPU and GPU. Apple markets it that way. "16-core Neural Engine. 38 TOPS." It's on every spec sheet. Here's the thing: it's not that simple, and some of the assumptions floating around are just wrong. What I built: A bare-metal ARM SME2 bytecode interpreter — custom opcodes, hand-written ARM64 assembly — that drives the M4 Pro Max (or M5) matrix tiles directly. No CoreML. No BNNS. No frameworks. Just raw instructions on the CPU's Note: there is a reason for the interpreter approach: these operations require the core to be in streaming mode, I assume to streamline memory load and store operations for z-tile computation efficiency (have to keep the unit fed). You can't inline the smstart or smstop instructions, so by using a simple bytecode interpreter several instructions can be chained together in the same stream session without having to write a new assembly kernel for everything you're trying to do with the matrix unit. The results? Performance characteristics that are identical to what Apple markets as the Neural Engine. Same throughput ceilings. Same restrictions (prefers int8, no FP8 support, same bf16/fp32 types). Same documentation (none). I ran a contention benchmark on M4 Max — GPU (Metal INT8), CPU SME ( Still, I know Apple's marketing mythos is powerful (I still have to convince Claude that the M4 has an SME unit from time to time). For people who still want to believe these are two independent units, I invite you to imagine the following scene:
What Apple calls the "Neural Engine" — at least on M4 — appears to be the Scalable Matrix Extension (SME2) built into the CPU cores, accessed through a software stack (CoreML/ANE driver) that abstracts it away. It's genuinely impressive hardware. Apple's marketing department deserves credit for making it sound even more impressive by giving it its own name and its own TOPS line item. But it's not a discrete coprocessor in the way most people assume. Once you understand that, you can skip CoreML entirely and talk to the hardware directly. Repo: https://github.com/joshmorgan1000/ane Includes an all-in-one SME instruction probe script. [link] [comments] |