@Kevin: as in in general because the M1 is not x86 and emulating x86 is usually slow… so it would probably… actually run about the same if the GPU does all the heavy math, assuming that would all work just fine, and there existed emulation software that did this exact thing ralready… or did you. mean if they got it to run on the 16-core dedicated Neural Engine in the M1, which Apple claims is over ten times faster at machine learning and AI stuff? (I’m curious about both, actually, cause I"m very tempted to get an M1 to just make my streaming problems go away entirely and and did seem very promising.)
I use a 2020 Mac mini with 8th gen i7 + BlackMagic eGPU. I’ve found some strange timing differences in related tasks: for instance, live streaming to Twitch, I can’t get a stable 1080p60 stream out of the the graphics card - I will start dropping frames trying to go over 40fps - but only for live streaking, and the GPU is noehere near maxed out. And live recording to disk it’ll happily work at 60fps with identical settings - no idea why. I’ve also tried the T2 chip in those Macs, and they actually did quite well at 60fps, not even warming the CPU up all that much - but the output stream had a much higher bitrate, which was just too much for some of my viewers - it’d need about 5-10MBps, and the T2 doesn’t really care that I want it smaller, because my viewers get lag spikes. So finally the CPU software renderer is not even taxing the CPU halfway when producing a stable and sharper1080p60 signal - at 3MBpts, making the viewers happy. 
Somehow I Neve rthought to test that with VEAI - and I found out the other day that I forgot to tell Gigapixel AI to use the GPU, which explained why it felt like much more of a drain on the system, but it also wasn’t noticeably faster when I did switch it over?
I’m curios now how the CPU would do at the videos… hmmm… I’ll try later with a some old VHS rips. ^^
I wonder if the image dedicated encoding features of the T2 would be any good for running an AI like that. I can’t find details on the exact operations it can do, but since it’s a lot of MPEG/MP4 I assume it’ll mostly be Fourrier Transforms, wihich don’t sound all that useful… but it was SO FAST and stable, for real!
Also @reiner: sorry for being a pedant: it’s not really that the CPU is much slower at the math that it would need to do - it’s actually faster at those calculations. It’s really more the second poitnt you mentioned: a high end GPU like the RTX 3070 has a modest clock speed of around 1.5 GHz with matching RAM - but most pixels in a game image don’t really depend on the value of other pixels elsewhere in the same image. And you really only need very basic linear algebra to calculate the color they should have - the most common task is multiplying vectors and matrices. So those cards have thousands of mini ALU clusters that can only multiply and maybe add simple numbers, because that’s how you multiply matrices: field by field. But they still calculate large portions to full frames in a handful of. clock cycles because those can all be calculated independently.
Higher end cards may even be clocked slower: the RTX 3090 is at 1.4 Ghz. And it’s being fed by slower still RAM at 1.2 Ghz. But it has twice as many mini ALU clusters, and there’s more RAM spread over more chips increasing the bandwidth, so it it can push way more pixels - or larger, shaper images - in the same time frame.
So coming from there: it turns out that image synthesis and AI share a lot of properties when it comes to math they need to do. 3D renders will at most deal with 4x4 matrices, which means the most complex simgle operation is multiplying two 4x4 metrices. But the GPU does not typically even have a concept of what a matrix or vector is for this purpose: It’ll split it up into multiplications and adds, like we learned in school. It doesn’t even need to be smart about it and analyze the matrices for sparse cells and such, because just not doing any of that is 43=64 multiplications and (4-1)*42=48 additions. A singe one of these mini-ALUs can very well do those in parallel in a few cycles - there’s some dependence there, but it turns tout that it’s not gonna just do the one matrix. It’s a lot of them, so they are very simple to stagger where need it has enough mul/adds to merge several ops like that into one mini ALU. And it has literally thousands of THOSE in turn. So it doesn’t do much math, or even vaguely hard math - and it’s slower for any single operation. But it’ll be doing thousands per cycle at any time.
The kinds of calculations that you do for an AI are really very similar: For instance if you have a neural network for a task, you are putting input signal int he form of vectors into very large, multi-dimensional matrices of artificial neurons. There’s a few twists those neurons can be programmed with but the very basic activation function they perform is multiplying every element in a vector of neural weights with the matching input vector element and then summing those pairs up. It’ll be way larger vectors, but even at say 50-dimensions, a transfer function like that would still only be 50 multiplications and 49 adds - and the extra twist, whatever it is, is more like a handful of ops, so it doesn’t matter in context. That’s about the same number of operations and even very similar distribution between multiply and add as the 4x4 matrix multiplication. So the GPU is doing the exact same thing - and typically the hardware doesn’t know anything about neurons. for all it knows it’s still playing Quake. Image rendering and AI stuff are pretty much the same in terms of math and as far as that GPU knows.
Meanwhile, your typical CPU is often clocked at 3-4Ghz these days - but they need to do a lot more diverse things than just multiply and add. It’s still common, but there’s also a lot of logic and control flow and shuffling bytes around involved and it’s way less nice to predict, too, because sometimes it might take a few cycles longer to send bytes to the hard drive, or the human presses keys that it needs to decide what to do with. So it wouldn’t make sense to stuff a hundred super dumb multiply/adders in one core, it wouldn’t be feasible to use most of at the same time. Individual cores might be able to simultaneously do a few arithmetic and logic operations at the same time while waiting for data they told other parts of the computer to move. And tit’s way more complex to know if you can run a sequence of operations at the same time because there’s way more ways they depend on each other. They might have a few dedicated small matrix and vector instructions, but they are not teasy to use, and you can’t always rely on them being supported or even performing well.
So they won’t do thousands of things at any one time in one of the cores, like that mini ALU does all the time. More in the order of less than 5 + a few waits. And the CPU also doesn’t have thousands of those cores to spread over - it’ll be more around 10 or so. That leads to the weird observation that the CPU can do way more things, way more smartly and efficiently and faster. But absolutely not at scale. So rendering a picture will take a very long tiime compared to ta GPU being asked to the same, because while the CPU core will have no issue with multiply and add, that will leave most of the core dormant because it isn’t being fed logic or other ops that it could do while the math portion is calculating this single field, or maybe a handful because the stars and data layout aligned - while the way simpler mini ALU in th GPU is doing several full matrices. equalling hundreds of those ops the CPU hasn’t;t even read yet… And even if you clocked the CPU and RAM magically at 100x of the GPU for it to keep up, core by corre - it still only has 10 of them, not thousands.
It’s basically like playing Protoss and getting Zerg rushed - except you’re playing against hundreds of Zetrf players at the same time and you only brought one Terran mate to help you out.
eh… whoops, that was long… sorry, linear algebra is actually one of the hobbies I do to chill out - whoops,… sorry. ^^8