Performance Discussion

Thanx for getting back.

It seems, I did a bad job in getting my point across :slight_smile:

GCN1 and 2 are indeed very slow in FP16. GCN3 and 4 introduced FP16 with the same speed as FP32. Fiji Cards even are advertised by AMD explicitely for FP16 workloads (running inferences for object recognition on them on the keynote developer demo)…

I explicitely checked for GCN 3 and 4 - FP16 Performance in several scenarios. CLPEAK measures FP16/32/64 scalar and vector and shows exactly the same performance for full and half precision. Tweaking the JSON Files in TVAI to force FP16 usage also shows the same or even slightly higher performace with the FP16 models loaded, compared to the FP32 ones. Using last years benchmark tool from you also shows this, vulkan compute tests with different inferences of Waifu2x, ESRGAN, etc… also prove that this works. I am at the office right now, so I don´t have acces to the screenshots of the results - one test in recent TVAI Beta is shown in my post. All these numbers seen there are results from running FP16 model.

Since vram usage is halfed, bigger tile sizes can be used, which - together with the smaller transfer time for the smaller moedls - actually improves performance in FP16 modes further. While the compute power is not better in FP16 on GCN3 and 4, the bigger tile sizes and faster transfers help in terms of speed.

I tested this intensively on several GCN 3 Cards (Tonga, Fiji) and multiple GNC 4 Cards (RX470,550, etc…) and it worked in every scenario. When I do single Image scaling with some open source image/video software that uses vulkan/onnx/directml, I always switch to the FP16 models.

I am completely aware that focus nowadays is not on old hardware and that any matrix vector thingy like a tensor core gives much higher performance.

But since the change to simply load up FP16 models instead of FP32 either by manual switch oder by automatic detection of GCN3/4 should be very little work to implement and given the fact that there are enormous amounts of users of these cards still out there, I thought it be worth a try recommending.

The last peace of motivation to raise this topic again came from the fact that the speeds achievable actually are not too bad - much better than one would anticipate from typical gaming benchmarks. Your ONNX optimizations work wonderfully and after all - a FJII GCN3 Card nowadays can be bought for around 50 to 80 bucks and is able to deliever almost 9TFLOPS of Fp16 perfromance with very high bandwith of 4096 Bit HBM Memory (~500MB/s) … This clearly outperforms most Pascal cards and in my test bench, using two fiji cards - actually gives speeds in terms better than realtime for clean 1:1 passes on DVD Material. A year ago, we dreamed of speeds like this even when using RTX Cards.

Of course I´d not recommend anyone doing a lot of TVAI work to favour a card like this over a midrange RTX - but there are many people around still using these cards, they still make decent FHD Gaming cards - and feeded with FP16, even FHD inputs are usable… Something like a Radeon PRO DUO with 2 x 8 or 16GB actually then becomes something usable and getting close to speeds of a small RTX card in some scenarios.

If the selection for GPUs was extended to “take card 2 and 3 for computaion” that also would help a lot in this scenario… WIndows and background processes tend to waste VRAM randomly - so setting up a system with an iGPU (which essentially almost every intel CPU of the last ten years has anyway) to take over the display part of the desktop, etc… would leave the possibility to set two “compute cards” to do the processing… At the moment, we only have the possibility to select either one or all cards… If “all cards” is set - one of them always has to deal with “the rest” of the system, leading to performance losses, occasional VRAM shortages, etc…

I imagine that implementing a “select card to use” switch/selector could be done easily.

Thanx for taking the time to read all this :slight_smile:

2 Likes

On additional thought I forgot:

Using bigger tile sizes in most cases is more performant. In a sceanrio where VRAM size is the bottleneck, TVAI chooses a smaller tile size to prevent VRAM excess…

If I manualy force FP16 models, the internal decision making algo for the tile size still seems to calculate tile size based upon the VRAM usage of the FP32 model size and its VRAM demand.

If I manually delete all model files except for the one FP16 variant with bigger tile sizes and close the network connection - TVAI now uses a tile size it usually thinkjs is too big for the given VRAM - but of course because it is half the size in FP16, it actually fits into VRAM - and this also gives a slight speed bosst.

So manually tweaking only takes one so far - the last bit of optimization possible comes with a big penalty of manual tweaking work…

If TVAI internally would just decide to simply use FP16 in these scenarious, the tile size (according to VRAM present) automatically would be calculated correctly…

I think I was not as clear as I intended.

I have been giving feedback for years, and it is usually critical with things to improve. What I was commenting on was that this thread so far seemed to be overly critical for the results I was seeing.

To give you a personal example, I have never seen a Proteus model that I liked - they have all been bad. Except, interesting enough, this version of Proteus in V4 for 2x upscale. This by far has been the best Proteus model compared to previous versions - which is where I was finding feedback unusual.

Having said that, I did some testing on the 4x upscale version - and its very different. That is to say, this version says they updated the 1x, 2x, and 4x, but I don’t use the 4x model. Testing with the 4x model on various instances I can see significantly more smudgyness where detail is lost compared to the 2x.

I also don’t use a mac, so version on the mac I cannot report on.

What really got me, however, is that I have spent countless hours trying to refine an acceptable process that used the least amount of time. That is, if I am upscaling entire seasons of shows, taking 2x years to complete isn’t viable.

This version of Proteus is the first where I got an acceptable upscale in the shortest amount of time possible. A full 45 minute episode, processed and image extracted through Avisynth, upscaled to 1080p (equivalent), reassembled color corrected and exported through Adobe Premiere - and the whole process took less than 3x hours, on my 6x year old PC with a 1080 GPU.

Obviously not perfect:

  • There are strange artifacts around the eyes, particularly as they get further from camera. I assume this is the dialed back facial algorithm trying to enhance them but it doesn’t look quite right
  • There is the grid like pattern which is noticeable in some areas
  • There are some strange artifacts in various locations, unfortunately most noticeable on faces - but again primarily on faces that are further away from the camera.
  • Frame number selection is still not accurate for exporting clips.

This test footage I only spent an hour or so experimenting with settings, then a total of 3 hours for the full process - I think its pretty good compared to anything V3 was spitting out.

But then this is for the 2x only. On teh same settings but switching to 4x upscale, the results are significantly worse. As a comparison, this is a image slider between the exact same frame, same settings, on V4 but with 2x compared to 4x:

You would be forgiven for thinking they may come from two different models. There is also a lot more artifacts in the V4 - though I am not sure if that means the 2x is at a better state, or if they haven’t updated 2x to match 4x. Either way, I hope very much they are pushing 4x more in the direction of 2x than the other way around.

@suraj

Any update on the AMD Matrix Core front, Radeon ML?

Since TVAI is not the only app where a 4090 will add 2x over my W6800 i consider to leave AMD behind in therms of GPU.

@nipun.nath

Topaz Video AI Beta  v3.2.7.0.b
System Information
OS: Mac v13.0301
CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz  32 GB
GPU: AMD Radeon RX 480  8 GB
GPU: AMD Radeon Pro 580  8 GB
Processing Settings
device: 2 vram: 1 instances: 1
Input Resolution: 480x360
Benchmark Results
Artemis		1X: 	13.35 fps 	2X: 	08.69 fps 	4X: 	03.16 fps 	
Proteus		1X: 	13.88 fps 	2X: 	08.62 fps 	4X: 	03.12 fps 	
Gaia		1X: 	06.13 fps 	2X: 	04.24 fps 	4X: 	03.00 fps 	
4X Slowmo		Apollo: 	12.06 fps 	APFast: 	43.31 fps 	Chronos: 	02.06 fps 	CHFast: 	04.20 fps 	

Nvidia Mosty requires Windows .AMD is On Apple And Windows .M1+ Apple only.

Topaz Video AI Beta  v3.2.7.0.b
System Information
OS: Mac v13.0301
CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz  32 GB
GPU: AMD Radeon RX 480  8 GB
GPU: AMD Radeon Pro 580  8 GB
Processing Settings
device: 2 vram: 1 instances: 1
Input Resolution: 1920x1080
Benchmark Results
Artemis		1X: 	02.83 fps 	2X: 	01.78 fps 	4X: 	00.53 fps 	
Proteus		1X: 	02.80 fps 	2X: 	01.70 fps 	4X: 	00NaN fps 	
Gaia		1X: 	01.06 fps 	2X: 	00.76 fps 	4X: 	00.57 fps 	
4X Slowmo		Apollo: 	02.19 fps 	APFast: 	07.82 fps 	Chronos: 	00.42 fps 	CHFast: 	01.07 fps 	

Look at Number 00Nan.

W6800 is RDNA2 - I was under the impression that RDNA2 has no vector matrix units, RDNA3 has - or have I missed it ?

two GCN4 cards - another example of the usecase I presented a few posts earlier (GCN3 in this case)… Not sure about the outcome in macos, but under windows a performance increase could be achieved by running these in FP16 models with bigger tile sizes (see my post, my two Fury X achieved more than double your speed with manual patching TVAIs models…).
@suraj I am not sure how the mac version handles GCN4 cards - Are fp16 models loaded here?

ah, another thing I spot: You have “instances 1” in your log - AFAIK this correlates to the number of threads/streams tossed at the card (I might be mistaken on this one) - but I did see that turning the number of processes up one more reduces the number of simultanoius threads given each card - which actually helped in performance… Oddly, the behaviour was not 100% reproducable and I did not have time to investigate further. Give it a try and I would loev to hear what your system behaves like when doing so.

Yes, but there are some Radeon libaries that are in the making from AMD that do increase the performance even with RDNA 2 GPUs (maybe).

W6800 should be 50% faster than Q RTX 5000 and not on paar, in theory.

AMD did sleep in the A.I. game (until Feb of this year) and didn’t care much about software support.

So Nvidia did work on it since 2016 and now this did pay off.

GCN would be nice for A.I. but your screen would stutter because of GCN does prefer Compute and since ONNX is Direct ML its Direct X, so your Display pipeline is blocked the whole time.

Had a Radeon Pro VII in 2020 for testing and while Export of images via denoise A.I. the Display did stutter.

In contradiction a RTX 3090 didn’t care about what i drop on it.

This sits on top of DirectML or OPENCL, both are not the fastest… The ONNX Implementation already available is faster and already available in TVAI.
Regarding Speeds: While the pure scalar Shader performance of the W6800 is better in FP16 than the RTX 5000 - the tensor cores are able to deliever aorund 85 TFlops in FP16 - much faster than the W6800. AMD has always had better scalar performance - but Tensor cores for running inferences outperform this by a big margin… So unless I missed a hidden vecotr matrix thingy in RDNA2 - it won´t be able to keep up with most Tensor-core equipped cards… RDNA3 or ARC has similar units build in…

1 Like

compute mode on GCN helps in this regard, but you are right, scheduling is a thing… Same with NVIDIA in some cases (but as you mentioned - they addressed the compute departmend for mixed display/compute scenarios earlier, so the manage it better in most cases).

GCN Architectures - while very nice for scalar computations back then (I love my GCN cards… aging like fine wine…) sadly also had no vector matrix stuff… And while GCN Compute mode still was available in RDNA1 - all this does not help to compete against Tensor cores and alike…

I am always impressed by the speeds possible by older GCN Cards in computation compared to NVIDIA equivalents of the same era - and I do remember the times when tensor core support in TVAI was not optimal… Vega GPUs destroyed even midrange RTX Cards. But now that Tensor Core support has matured - these older compute monsters don´t stand a chance…

I do hope support for intel and amd vector matrix cores will get better, too.

2 Likes

Yea, pascal was realy weak in compute.

Intel is already very good, i think, but has only midrage performance of the last Gen GPUs and a Software that needs to mature.

AMD is in the game but didn’t realy care about it until ChatGPT did become a thing, so maybe we see a wonder when the PRO GPUs are out or we have to wait until RDNA 4.

OneAPI (intel ARC/XE) does compute not so bad, in my benchmarks with different inferences, it is completely fine for the money. In Stablediffusion it is roughly en par with 3060 to 3070. In Vulkan compute sometimes even a little better. The XE Cores are not fully supported in Vulkan compute at the moment, but is already quite nice. Blender lacks RT COre support, but its in the making, should either come with 3.6 or 3.7 - and the recent oneapi port for Cycles is rougly what a midrange RTX Card can do in CUDA… Not too bad for starters. In TVAI, it still needs optimization, the cards could run faster (manualy tweaking the loads reveal good performance, rivaling midrange RTX Cards for an A770).

Amd has been in the compute Game for a very long time, many super computers, datacenters, cloud services have been using AMD for years… The difference to Nvidia is, that they focused stuff like RocM only for these enterprise/science customers - and did not care to port it to the general desktop user/gamer/contentcreator/enthusiast…

So every science student that has been using a QUadro/Tesla at work could simply test out stuff and keep on working on his personal Gaming rig when using NVIDIA…

If one gets AMD Cards running, the really are not bad in computation. Just recently aquired some 16GB VEGA Enterprise Cards, that aren´t tooo bad for heavy workloads in FP16 scalar stuff (BOINC the sh…t out of them), but to get it running, I need to patch the kernel, swap something here, patch rocm there, etc… AMD really should be aware that the software needs to be “iPhone like” - they need to make it accessable and usable for the average Joe…

We will see a lot more inference-accelerators in the years to come. The ARM consortium has many players that already implement accelerators or are planing to, google and others have had specialized “AI accelerators” for years, Intel is adding more and more into the CPUs (newest XEONS are so good at inferences, you can perfectly do inference stuff without any GPU if you don´t need max performance…), etc…

At the moment its a jungle - Onnx,DirectML, TensorRT, Oneapi, metal… Topaz is relly doing a good job throwing all of these under the same hood - lets hope the market clears over time and makes implementations easier…

2 Likes

According to AMD DirectML is still much faster than their native libraries. They will let me know when that changes for our models, we will add it to the aiengine.

5 Likes

Bruh!

oh man.

thx for the update.

On mac we only run FP16 models. As far as the FP16 vs FP32 comparison for older AMD GPUs on DirectML last I checked for still very slow. This was with higher input sizes for the models to reduce the number of tiles. The other advantage FP16 models have is their ability to run from smaller VRAM GPUs but the performance was so slow that in some cases CPU was faster.
Currently there is no way to run FP16 models on older AMD GPUs, let me know if you want to try it out. I can get you a version that will work for you to experiement.

1 Like

thanx :slight_smile: I wrote you a message :slight_smile:

I got a notification this post had been moved - I think you took mine by accident, it has nothing to do with this thread and now is weirdly out of place.