Difference in Render Quality with CPU vs GPU, for Video Enhance AI

kevinbarre · February 18, 2021, 3:04am

I’m a little blown away at the difference in quality between rendering with the GPU vs. CPU, with the CPU being drastically higher quality. Both took approximately the same time to render. I wasn’t expecting there to be any difference in quality—just speed of processing.

Conversion settings: Gaia-CG v5, scaling 640x480 to 1440x1080, saved as MP4. The single difference in render settings was GPU vs. CPU, with the CPU output on the left. Notice the jagged diagonal lines in the image on the right. GPU is AMD Radeon R9 M390 2GB.

osol · February 22, 2021, 6:15am

Damn! I wonder what is the best CPU for upscaling?

Bluewave · February 25, 2021, 6:14am

This is quite interesting! What CPU are you using? I might make my own try later today, I was going to download the new version but read a couple of bugs on the new release thread so might as well wait a bit on the new version and make some test of my own, thanks for this!

kevinbarre · February 25, 2021, 4:31pm

I’m running a late 2015 iMac, 4GHz Quad-Core Intel Core i7 with 32GB of RAM.

I would love to see someone do back-to-back comparisons of the same video, processed with the exact same settings in VEAI, on different processors. In theory, there should be no differences. I’m very curious as to why the GPU is lower quality output. It’s quite possible that my Radeon is just a piece of junk.

lee.keels · March 30, 2021, 8:13am

Just did a quick comparison from a DVD rip of Golden Girls, DTD-3, 480 to 1080. Absolutely no difference whatsoever between CPU and GPU other than speed.

andy-0078 · March 30, 2021, 9:05am

I just tried a Gaia comparison - running Big Sur on an 8700K i7 4.5Ghz and an 8GB RX580 (home build hackintosh)

The CPU was a staggering 20 times slower than the GPU. I cannot discern any quality difference between the output videos.

ssbroly · March 30, 2021, 10:53am

hola! it seems to me that I had done the test between the gtx 1660 and my ryzen 7 2700, I don’t think there was a difference either, but I haven’t tried all the modes, I will look at it, in the meantime I deinterlace a film of 1h20 in 100 fps … it’s long … lol

reiner · March 30, 2021, 11:35am

because a CPU simply is much slower at the calculations needed… It´s not a defect or a flaw, it´s the way it is. A GPU has thousands of compute units, it´s in its nature to be much faster at stuff like this.

kevinbarre · March 30, 2021, 5:41pm

Anyone able to make this work on an Apple M1 chip?

viktorz3008 · April 2, 2021, 9:45am

How about returning your Mac and buy or build a real PC then?

maggie.danger · April 2, 2021, 10:06am

@Kevin: as in in general because the M1 is not x86 and emulating x86 is usually slow… so it would probably… actually run about the same if the GPU does all the heavy math, assuming that would all work just fine, and there existed emulation software that did this exact thing ralready… or did you. mean if they got it to run on the 16-core dedicated Neural Engine in the M1, which Apple claims is over ten times faster at machine learning and AI stuff? (I’m curious about both, actually, cause I"m very tempted to get an M1 to just make my streaming problems go away entirely and and did seem very promising.)

I use a 2020 Mac mini with 8th gen i7 + BlackMagic eGPU. I’ve found some strange timing differences in related tasks: for instance, live streaming to Twitch, I can’t get a stable 1080p60 stream out of the the graphics card - I will start dropping frames trying to go over 40fps - but only for live streaking, and the GPU is noehere near maxed out. And live recording to disk it’ll happily work at 60fps with identical settings - no idea why. I’ve also tried the T2 chip in those Macs, and they actually did quite well at 60fps, not even warming the CPU up all that much - but the output stream had a much higher bitrate, which was just too much for some of my viewers - it’d need about 5-10MBps, and the T2 doesn’t really care that I want it smaller, because my viewers get lag spikes. So finally the CPU software renderer is not even taxing the CPU halfway when producing a stable and sharper1080p60 signal - at 3MBpts, making the viewers happy.

Somehow I Neve rthought to test that with VEAI - and I found out the other day that I forgot to tell Gigapixel AI to use the GPU, which explained why it felt like much more of a drain on the system, but it also wasn’t noticeably faster when I did switch it over?

I’m curios now how the CPU would do at the videos… hmmm… I’ll try later with a some old VHS rips. ^^

I wonder if the image dedicated encoding features of the T2 would be any good for running an AI like that. I can’t find details on the exact operations it can do, but since it’s a lot of MPEG/MP4 I assume it’ll mostly be Fourrier Transforms, wihich don’t sound all that useful… but it was SO FAST and stable, for real!

Also @reiner: sorry for being a pedant: it’s not really that the CPU is much slower at the math that it would need to do - it’s actually faster at those calculations. It’s really more the second poitnt you mentioned: a high end GPU like the RTX 3070 has a modest clock speed of around 1.5 GHz with matching RAM - but most pixels in a game image don’t really depend on the value of other pixels elsewhere in the same image. And you really only need very basic linear algebra to calculate the color they should have - the most common task is multiplying vectors and matrices. So those cards have thousands of mini ALU clusters that can only multiply and maybe add simple numbers, because that’s how you multiply matrices: field by field. But they still calculate large portions to full frames in a handful of. clock cycles because those can all be calculated independently.

Higher end cards may even be clocked slower: the RTX 3090 is at 1.4 Ghz. And it’s being fed by slower still RAM at 1.2 Ghz. But it has twice as many mini ALU clusters, and there’s more RAM spread over more chips increasing the bandwidth, so it it can push way more pixels - or larger, shaper images - in the same time frame.

So coming from there: it turns out that image synthesis and AI share a lot of properties when it comes to math they need to do. 3D renders will at most deal with 4x4 matrices, which means the most complex simgle operation is multiplying two 4x4 metrices. But the GPU does not typically even have a concept of what a matrix or vector is for this purpose: It’ll split it up into multiplications and adds, like we learned in school. It doesn’t even need to be smart about it and analyze the matrices for sparse cells and such, because just not doing any of that is 43=64 multiplications and (4-1)*42=48 additions. A singe one of these mini-ALUs can very well do those in parallel in a few cycles - there’s some dependence there, but it turns tout that it’s not gonna just do the one matrix. It’s a lot of them, so they are very simple to stagger where need it has enough mul/adds to merge several ops like that into one mini ALU. And it has literally thousands of THOSE in turn. So it doesn’t do much math, or even vaguely hard math - and it’s slower for any single operation. But it’ll be doing thousands per cycle at any time.

The kinds of calculations that you do for an AI are really very similar: For instance if you have a neural network for a task, you are putting input signal int he form of vectors into very large, multi-dimensional matrices of artificial neurons. There’s a few twists those neurons can be programmed with but the very basic activation function they perform is multiplying every element in a vector of neural weights with the matching input vector element and then summing those pairs up. It’ll be way larger vectors, but even at say 50-dimensions, a transfer function like that would still only be 50 multiplications and 49 adds - and the extra twist, whatever it is, is more like a handful of ops, so it doesn’t matter in context. That’s about the same number of operations and even very similar distribution between multiply and add as the 4x4 matrix multiplication. So the GPU is doing the exact same thing - and typically the hardware doesn’t know anything about neurons. for all it knows it’s still playing Quake. Image rendering and AI stuff are pretty much the same in terms of math and as far as that GPU knows.

Meanwhile, your typical CPU is often clocked at 3-4Ghz these days - but they need to do a lot more diverse things than just multiply and add. It’s still common, but there’s also a lot of logic and control flow and shuffling bytes around involved and it’s way less nice to predict, too, because sometimes it might take a few cycles longer to send bytes to the hard drive, or the human presses keys that it needs to decide what to do with. So it wouldn’t make sense to stuff a hundred super dumb multiply/adders in one core, it wouldn’t be feasible to use most of at the same time. Individual cores might be able to simultaneously do a few arithmetic and logic operations at the same time while waiting for data they told other parts of the computer to move. And tit’s way more complex to know if you can run a sequence of operations at the same time because there’s way more ways they depend on each other. They might have a few dedicated small matrix and vector instructions, but they are not teasy to use, and you can’t always rely on them being supported or even performing well.

So they won’t do thousands of things at any one time in one of the cores, like that mini ALU does all the time. More in the order of less than 5 + a few waits. And the CPU also doesn’t have thousands of those cores to spread over - it’ll be more around 10 or so. That leads to the weird observation that the CPU can do way more things, way more smartly and efficiently and faster. But absolutely not at scale. So rendering a picture will take a very long tiime compared to ta GPU being asked to the same, because while the CPU core will have no issue with multiply and add, that will leave most of the core dormant because it isn’t being fed logic or other ops that it could do while the math portion is calculating this single field, or maybe a handful because the stars and data layout aligned - while the way simpler mini ALU in th GPU is doing several full matrices. equalling hundreds of those ops the CPU hasn’t;t even read yet… And even if you clocked the CPU and RAM magically at 100x of the GPU for it to keep up, core by corre - it still only has 10 of them, not thousands.

It’s basically like playing Protoss and getting Zerg rushed - except you’re playing against hundreds of Zetrf players at the same time and you only brought one Terran mate to help you out.

eh… whoops, that was long… sorry, linear algebra is actually one of the hobbies I do to chill out - whoops,… sorry. ^^8

reiner · April 2, 2021, 11:12am

Thanx for the long text Usually I am the one writing these elaborate texts, so I am happy to see someone else is taking the time to dig deeper Over at the Facebook Group and in other forums I also take a lot of time to go into more depth - here I didn´t start. So thanx for giving everybody a more detailed look. Maybe you want to head over to Facebook and give you version of “not using coda cores is a bad thing” topic

reiner · April 2, 2021, 11:21am

Maybe you digged into this and know some usefull info: I am looking for some Tensorflow 2d Matrix FP16 Benchmarks on the newer GNA2 on Tiger Lake and was wondering if the Desktop Variants deriverd from that really tossed out AVX512 - do you happen to have some more info on that ? Seems the ULV Variants of GEN11 are a better buy in terms of vector berformance and Compute Units in the GPU.

maggie.danger · April 2, 2021, 12:01pm

Aaah my. faces red, lol. I actually don’t think making more use of CUDA is bad at all - for anything that can be reduced to either projections, or a neural network, or at least is mostly that and the handful of special purpose logic chip for… flow control or the handful of divisions you have to do for coordinate conversion after you’re done multiplying the vector with a few hundred matrices xD.

I have a colleague who used to work for Nvidia and… eh definitely raised some really impressive stuff through CUDA. It does kinda feel that at least back in the day they weren’t;t… very usable effectively for the average person trying that. It’s very much like how the 2 VU units in the PS2 were loathed by most game programmers because “hardD” - but. then you see one of the very late titles for the console doing stunning things with it. Like, one the *Mana titles, I forget the title, actually is running a full port of Havoc on one of the cores, meaning… this puny playstation 2 has both butter smooth graphics and basically Half Life 2 physics. Or the stuff the Demo scene makes happen. The PS3 is another one of those computers where (quote): “THOSE VUS ARE COMPLETELY AND UTTER BASTARDS” - but they are also super impressive when they ‘re used rigt, cause they’re absolute beasts for vector math. And that’s part of the CPU there, they got another really souped up actual GPU on top, so demo folk escpially come up with wild things that… say rights shouldn’t’ be possible. xD

Thank and you for understanding about the long form - and I do feel better now to know I’m not the only one.

I did not see any Intel gen11 benchmarks for those features… that I would’ve trusted anyway. I did go through a lot of benchmarks at work but that’s not a feature that would’ve come up with the workload they were gonna be exposed to. They are really impressive without tho, but those were not desktop chips.

I’ve only seen FP16 vector benchmarks come up once - but I did not trust those, because they were from Intel and they compared against a discrete NVIDIA GPU - but not the AI features of the chip on its own, they also used the built-in GPU to help a bit. It’s not like that’s a bad idea, but … the iGPU is pretty strong and that gave no indication about the performance of either on its own, and that’s also way harder to program for. ^^8

I have seen the AVC512 extensions in action on some Skylake X machines - the use case wasn’t quite ideal, but it still made a noticeable difference in how much they could do, and at least the server variants ot that chip were solid enough that I’d give them the benefit of the doubt, especially after seeing the very creative use Intel found for their T1 and .T2 TPM modules as astoundingly effective video encoding beasts.