RTX 4080 vs. RX 7900 XTX

falkor · October 17, 2023, 4:04pm

Completely wrong. At the moment consumer market divided by NVIDIA’s AI-processors and HW-accelerators by other companies. But may be two year later it will be other state.
So selection for right now is obvious and depend on you requirements.
And it is not related to T%L products. Many companies now got the same race. So company which can use most powerfull HW available will be winner for nearest time (until HW market change)

jo.vo · October 17, 2023, 4:14pm

No, you’re wrong. Just because card of manufacturer A is somewhat slower than the top model of B this doesn’t mean that it can’t be a valid choice at all.
performance/cost ratio is also important, for some performance/watt as well.

Plus, and this is IMO a really important point: If manufacturer B has the fastest card at the moment but has a history of driver issues leading to crashes and hardware defects (RTX 4090 series, anyone?) and the GPU of manufacturer A does the same just a bit slower but reliable it might even be the preferable choice.

notes4me · October 17, 2023, 4:18pm

I really like the vivid discussion here. This is evidence that my decision is not easy if you have an eye on money.
Maybe owners of RTX 4080 (4090) and RX 7900 XTX could post built-in benchmark results for resolutions below HD, by chance?

falkor · October 17, 2023, 4:24pm

But it is exactly what I said - you should select HW according your requirements and workflow. If you are gamer and only sporadically make AI processing - cheaper but weaker cards are your choice. If you are envolved in massive AI-processing - you need normal AI-processor even from consumer segment

falkor · October 17, 2023, 4:39pm

I think there is multiple benchmarks:

Pay attention to other hardware noted (CPU, RAM and so on)

jo.vo · October 17, 2023, 5:36pm

Again: if your “AI-processor” often fails encodes or even crashes the whole system due to driver issues, the formally slower card can effectively be faster because it finishes all encodes.

TPX · October 17, 2023, 6:30pm

simple answer.

The 7900 XTX is not a good comparison here, the others run circles around it.

TPX · October 17, 2023, 6:31pm

Thats it.

If the export takes 8 hours, a GPU that needs 16 hours will also do well.

alexatkin · October 17, 2023, 8:22pm

Don’t assume many reports online means something is a common problem.

I have a 4090, 2x4070 Ti and a 3080 - never had a problem with any of them other than the 3080 I needed to upgrade the PSU due to the well known problem of sudden spikes in power above the PSUs limit.

Imo · October 17, 2023, 9:28pm

I fitted a 1000 W beQuiet Strait Power 11 platinum class PSU. I don’t think the issue is power related because the black screen appears in idle also.

notes4me · October 18, 2023, 10:27am

Alright, I got an RTX 4080 for a good price.

TheRustyRebel · October 18, 2023, 10:54am

It’s funny you say this, I ran into the same issue when using Epson photo scanner while listening to YouTube and my screen went black, the GPU fans went 100% speed, the audio still coming through. I turned off computer and left it off for 10 mins without plug inside, returned to turn it on and GPU dead, fans speed as normal but no signal input or anything, was only 9 months old 3080Ti. I had a 980Ti which I had for 4 years and put through extreme tests and long term stress and not once did it ever fail or have issues, it seems older tech is far more durable and reliable, even though they are a fraction of the price of current new GPU’s.

alexatkin · October 18, 2023, 3:16pm

I don’t think its so much that old tech is more reliable, its just there were a lot of factors during the 30x0 series launch that led to perhaps using lower quality support components on some cards. There’s also the fact RTX cards pretty much auto-overclock, compared to how old cards wouldn’t be pushed as hard at factory default, you’d have to manually overclock.

Of course we won’t know for sure until 4 years after the cards came out how widespread the failures are. More people reporting the problem online doesn’t necessarily mean its more common, just that its become more common for people to actually talk about it online.

Also the rise of AI I think inherently means people are pushing their cards more. They might game during the day and run an AI job when they’re not gaming, so a higher percentage of cards are getting stressed compared to before.

robd · November 2, 2023, 2:10pm

I’ve had an RTX-2060 and RTX-3060TI both black screen and crash the PC using Topaz that are often random but not with other AI video upscalers so there is something there that is an issue and it is not a PSU problem, I have access to a number of computers, done clean installs etc.

Having looked closely at the output the video output from the Nvidia RTX is better than AMD’s RX7600 I have, the Nvidia video has more detail rendered and its HEVC encoder is more efficient using the same sources and comparing them.

So Nvidia is better & faster but its a huge pain due to the random crashes Topaz has with RTX hardware.

I’m running Windows 10 if that is somehow part of the cause, someone above said Windows 11 helps with less crashes (somehow) so maybe I will try that, though a Linux version would be much more desirable.

notes4me · November 2, 2023, 6:56pm

Looks good so far.

jojje · November 8, 2023, 3:10pm

Here you go

Topaz Video AI  v3.5.4
System Information
OS: Windows v10.22
CPU: AMD Ryzen 9 7950X 16-Core Processor  127.15 GB
GPU: NVIDIA GeForce RTX 4090               23.59 GB
Processing Settings
device: 0 vram: 1 instances: 0
Input Resolution: 640x480
Benchmark Results
Artemis     1X:     174.99 fps  2X:     70.76 fps   4X:     18.10 fps   
Iris        1X:     209.16 fps  2X:     93.39 fps   4X:     26.43 fps   
Proteus     1X:     134.76 fps  2X:     61.73 fps   4X:     17.71 fps   
Gaia        1X:     87.91 fps   2X:     65.17 fps   4X:     25.29 fps   
Nyx         1X:     76.73 fps   
4X Slowmo       Apollo:     208.59 fps  APFast:     413.07 fps  Chronos:    170.18 fps  CHFast:     209.77 fps

Notes:

I’m using 4 RAM slots on the machine which results in a lot lower RAM speed than 2-DIMM configurations (the current DDR5 debacle). I have Hynix DDR5 6600, but due to the 4-DIMM memory configuration, they only run at about half the rated speed (1.8 GHz instead of 3.3 GHz)
I’ve under-clocked the CPU to 4.8 GHz to halve the CPU power consumption, but it does mean that it performs about 2-5% worse than stock clocking.
TVAI isn’t very good at utilizing the GPU for upscale. Doing a 4x upscale with iris-v2 using tensor corse only consumes 31% GPU but a massive 40% CPU. As such a 3080 or 4080 card should perform just as well, since the work seems to be CPU-bound. A decent graphics card and a very good processor would be what you need to pair TVAI with.

Still, I hope this gives you an indication of what to expect from a RTX 4090 card, and a CPU that is in the ballpark of being sufficiently specced for TVAI.

TLDR tip is buy the fastest CPU you can, and save a ton of money buying a lower end graphics card (4080 or 3080 should be enough, even overkill as it stands today).

lhkjacky · November 8, 2023, 4:14pm

Topaz Video AI  v3.5.2
System Information
OS: Windows v11.22
CPU: 13th Gen Intel(R) Core(TM) i7-13700K  31.773 GB
GPU: NVIDIA GeForce RTX 4070  11.744 GB
GPU: Intel(R) UHD Graphics 770  0.125 GB
Processing Settings
device: -2 vram: 1 instances: 1
Input Resolution: 640x480
Benchmark Results
Artemis		1X: 	109.69 fps 	2X: 	74.87 fps 	4X: 	23.22 fps 	
Iris		1X: 	108.59 fps 	2X: 	66.22 fps 	4X: 	22.06 fps 	
Proteus		1X: 	105.16 fps 	2X: 	67.43 fps 	4X: 	23.10 fps 	
Gaia		1X: 	39.38 fps 	2X: 	26.85 fps 	4X: 	19.34 fps 	
Nyx		1X: 	36.70 fps 	
4X Slowmo		Apollo: 	108.19 fps 	APFast: 	252.39 fps 	Chronos: 	77.56 fps 	CHFast: 	108.73 fps

jojje · November 9, 2023, 9:27am

Very interesting Jacky, especially the Artemis and Protheus results.

The 4090 has 2.1x the number of CUDA cores compared to 4070 and about 2.7x the number of tensor cores. Still the 4090 results are worse than the 4070 for those two models. Esp on 4x.

Given that the i7-13700K is pretty close to the 7950x (~10-15%) in non-massively threaded workloads, and that I’ve under-clocked my CPU which would close the CPU performance gap even further, that leaves the two main variables of interest: 1) the different versions of TVAI used, and 2) the memory configurations and consequently data bandwidth.

I suspect that if I’d pull two of the DIMMs, allowing the RAM to run at double the speed and consequently double the bandwidth, the results would change. But how much, impossible to say.

I’m tempted to do that, but having spent weeks getting the machine stable with 4 DIMMs, I’m not going to Would be useful if someone else running a 2 DIMM config on a 7950x with a 4090 could post their results.

Edit: it would be super useful if @topazlabs could add reporting of the CPU and GPU utilization numbers as part of the benchmarks. Those numbers would tell us if the system is bottlenecked on anything other than CPU or GPU, and how much.

I did report my CPU and GPU utilization for that exact reason, as the low GPU and CPU usage clearly indicated something was holding the hardware back. Theoretically it would be one of three reasons:

A) Inefficient locking in the TVAI code, preventing data from efficiently flowing to and from CPU and GPU and RAM ↔ CPU respectively at maximum speed.
B) Bottlenecks due to memory latencies (e.g. inefficient data structure layouts in memory, resulting in a lot of cache misses and amplified memory I/O).
C) Memory bandwidth bottleneck between CPU ↔ RAM, and RAM ↔ VRAM

A and B are problems that Topazlabs engineers could work at solving / improving (in their hands). C is nothing they can do anything about, but is up to the respective customer to fix.

UPDATE: I found an answer to the question myself, by managing to boost the memory speed from the paltry 1.8 GHz to 3 GHz (closer to the rated 3.3 GHz on the DIMMs).

As suspected, the performance impact of memory bandwidth is Significant for VAI workloads. As such we now know that VAI must be paired with fast RAM, or else any fancy CPU and GPU pairing is just a complete waste.

The speed improvement for 4X was a staggering 40%, while around 30% for 2X. Slowmo models didn’t seem to be impacted much, if at all, the latter ones clearly aren’t memory bound.

More details here: DDR5 3600 vs 6000 comparison table

Topaz Video AI  v3.5.4
System Information
OS: Windows v10.22
CPU: AMD Ryzen 9 7950X 16-Core Processor              127.74 GB
GPU: NVIDIA GeForce RTX 4090  23.59 GB
Processing Settings
device: 0 vram: 1 instances: 0
Input Resolution: 640x480
Benchmark Results
Artemis     1X:     201.17 fps  2X:      92.99 fps  4X:       26.39 fps   
Iris        1X:     204.39 fps  2X:     128.32 fps  4X:       35.63 fps   
Proteus     1X:     177.40 fps  2X:      84.51 fps  4X:       24.47 fps   
Gaia        1X:      88.49 fps  2X:      64.19 fps  4X:       35.55 fps   
Nyx         1X:      79.37 fps   
4X Slowmo   Apollo: 215.43 fps  APFast: 463.44 fps  Chronos: 172.43 fps  CHFast: 216.31 fps

RAM: SK Hynix A-die DDR5 running at DDR-6000 (3 GHz)

Another observation was that VAI isn’t a very threaded application, since only 5 cores were used. Of those five, only a single core was extensively used. This indicates that a 8 core or even a 4 core (8 threads) CPU should be more than enough, and that some money can be diverted to good RAM instead. No need for a 16-core CPU, as that’d just be a waste of money.

jo.vo · November 9, 2023, 9:51am

Not really, as this doesn’t show simply bad code.
On the M1 Pro and the M2 Ultra the GPU is both used 100% according to sysinfo tools. Still the M2 Ultra with 60cores is only about 2 times faster than the M1 Pro with it’s 16 GPU cores. So, in spite of alledged full GPU use there’s no real scaling.

Another example: Iris earlier had speeds of nearly 30fps here with 100% GPU use. Now after a (quite catastrophic) model update it’s only 16-18 fps for the same task - but still 100% GPU use.

jojje · November 9, 2023, 9:58am

That’s different. You’re talking about code performance at GPU capacity. What I was focusing on was when the system isn’t able to run at capacity in terms of compute. Completely different problems.

Edit: Concerning the different performance for different versions of the models, that’s something I’ve run into quite a lot with my own models when converting to ONNX. Often it’s due to some change in the (model) network architecture that makes the ONNX optimizer select sub-optimal tensor operations for some given representation in tensorflow / pytorch. No idea how the tensorflow → tensor-core or tensorflow → CoreML compilation works, but it’s likely they work similarly to ONNX in that those frameworks have compilers that use heuristics trying to translating tensorflow operation representations to other operations on the target inference architectures. The smallest tweak to the model architecture could throw the compilation step off, and cause a sub-optimal operation (or sequence) to be produced by the compiler. The result is that the compute units still spend all their time doing work, but inefficiently so by taking a longer path (more compute cycles).

Analogy is picking the wrong path at a cross-road and picking one which is longer and climbs uphill, instead of one that is straight; both though leading to the same next fork in the road.

There are tools at Topazlabs disposal for looking at and debugging the produced execution graphs that represent the networks they upload to their CDNs, so for them this isn’t a black box. It’s however one more step in the QA pipeline, and they may not always do that verification (in fact, we know they don’t, given the amount of bugs being re-introduced between even minor releases whenever they ship updates)

Oh, and there could be an even simpler explanation to your observation; the topaz folks could simply be changing the network architecture such as adjusting the depth and width of the layers. A simple way to show indication of that would be to just look at the model size. If they expand the network, it will result in more parameters which directly translate to larger “tz” model files, since network weights hardly compress at all (very high entropy), allowing us to use the size of the tz files as a proxy for how complex (large) the models they use are.