PCIE bandwidth and AI models

Alpheratz · May 29, 2024, 4:38pm

Dear community / experts

Does GPU AI models (my main use will be 2K-4K upscaling / denoising / sharpening ) are PCIE bandwidth intensive ?

I am minding to use 4 of my 5 PCIEx16 Gen3 slots
Best I can get simultaneously is :

16 lanes Gen3
16 lanes Gen3
8 lanes Gen3
8 lanes Gen3

On such setting I would like to avoid bandwidth bottleneck

Also why 3 or 4 GPUs is not “recommended” ?
If this feature exists, it is for being used, else disable it ?! A bit confused

Regards

TPX · May 29, 2024, 4:42pm

No, GPU intensive, but your CPU is able to slow down the GPU depending on the setttings and outputformat.

Its not GPUs, its how many videos are rendered at the same time.

Alpheratz · May 29, 2024, 4:58pm

Perfect !

Sorry for my (again) missread of label

My purpose is to render one video after another but with best quality for a reasonnable time. Upscaling ****ing 1080p that looks horribly blury on my screen (even if Sony Reality Creation CPU doing its best, that’s not enough for me )

Currently I stack 3 AI models and my 2x 5700 XT are falling down
Will go for a 4x RTX 3070 then

Thank you

TPX · May 29, 2024, 5:40pm

Adding a second gpu will only add 25% in speed at the moment.

Its not clear to me how that will change in the near future.

Alpheratz · May 29, 2024, 5:42pm

Omg !

Changing that is a must have

TPX · May 29, 2024, 5:48pm

From the recent release thread of Video AI.

Alpheratz · May 29, 2024, 5:55pm

Will give it a try asap

What about RAM impact ?
I read :

But in practice for my need (upscaling / desnoising 1080p), I never go above 16 GB used
So ?

Alpheratz · May 30, 2024, 6:01pm

Actually giving some tries with 4/5 RTX 3070 and things does not end up very well …

5x MSI RTX 3070 Gaming X Trio
➔ Unable to run benchmark
➔ Load models perfectly, start to encode
➔ GPU load 5x 68%, then “Error” after few minutes
➔ Rock solid under Linux with such setting

PCIE settings

Gen2 x4 BIOS set to 4x4x4x4 1 lane ~ 5 GTs / 500 MBs
Gen2 x4 BIOS set to 4x4 1 lane ~ 5 GTs / 500 MBs
Gen2 x4 1 lane ~ 5 GTs / 500 MBs
Gen2 x4 BIOS set to 4x4x4x4 1 lane ~ 5 GTs / 500 MBs
Gen2 x4 BIOS set to 4x4 1 lane ~ 5 GTs / 500 MBs

4x MSI RTX 3070 Gaming X Trio

Gen3 x4 BIOS set to 4x4x4x4 1 lane ~ 8 GTs / 985 MBs
Gen3 x4 BIOS set to 4x4 1 lane ~ 8 GTs / 985 MBs
Gen3 x4 BIOS set to 4x4x4x4 1 lane ~ 8 GTs / 985 MBs
Gen3 x4 BIOS set to 4x4 1 lane ~ 8 GTs / 985 MBs

➔ Benchmarkable, results below

Topaz Video AI  v5.0.4
System Information
OS: Windows v11.22
CPU: AMD Ryzen Threadripper 2990WX 32-Core Processor  31.899 GB
GPU: NVIDIA GeForce RTX 3070  7.8301 GB
GPU: NVIDIA GeForce RTX 3070  7.8301 GB
GPU: NVIDIA GeForce RTX 3070  7.8301 GB
GPU: NVIDIA GeForce RTX 3070  7.8301 GB
Processing Settings
device: 4 vram: 1 instances: 0
Input Resolution: 1920x1080
Benchmark Results
Artemis		1X:	17.35 fps 	2X: 07.72 fps 	4X: 02.28 fps 	
Iris		1X:	26.20 fps 	2X: 11.73 fps 	4X: 03.10 fps 	
Proteus		1X:	25.96 fps 	2X: 11.72 fps 	4X: 03.25 fps 	
Gaia		1X:	15.41 fps 	2X: 09.20 fps 	4X: 02.96 fps 	
Nyx			1X:	06.63 fps 	2X: 08.91 fps 	
Nyx Fast		1X: 	 22.01 fps 	
4X Slowmo		Apollo:	 13.69 fps
		 		APFast:	 26.88 fps
 				Chronos: 18.60 fps
 				CHFast:  12.51 fps 	
16X Slowmo		Aion: 20.57 fps

Very poor GPU workload / CPU fully loaded but no use of SMT / LAN bottlneck (X550-T2 coming in) / SSD bottleneck (Evo 970 dedicated workload coming in)

So … Any expert have an opinion about this ?

Any PCIE bottelneck (explaining so poor GPU load) ? ➔ In such case, will buy PCIE x16 riser

Any coding bottneck with multiple GPU ?
➔ If so, any new release supporting it ?

Any models to improve ?
➔ For sure it does need tons of. Idid many tries with many models, some objets are handled very strangely (second plan face(do not make effort to volunteer second plan realisator, they need to stay blurry) : nose and eyes, tube that has a contrast tinted part, grass crossed, ears of wheat, contrasted leafy branches), poor stars or space environment understanding, motion compensation.
On oposite side, awesome job on : naked trees, wheel with radial legs, texts, faces, denoising (really kicking the ass), deinterlacing (even with pure crap interlaced) is perfectly patched

Moving now to 5.1 for some new tests.
Feel free to bring what could be devs/ admin Ready to help (want my money valuable )

Regards

TPX · May 30, 2024, 8:58pm

I think its the processor, and of course the current design of TVAI’s multigpu computing.

Its cache Architecture needs to communicate through other processors to get the data from ram.

I have workloads where my 7950X (16 cores, Zen 4, 140 watts ) was 60% faster than my 3960X (24 cores, Zen 2, 280 watts).

But if you want you can search for bios optimizations on the amd website (provided the board supports them), there are different bios settings for the respective architecture depending on the workload.

Alpheratz · May 30, 2024, 9:31pm

Thank you, I can believe it, allready got strong troubles when using SMT an Win11
Going to install the latest BIOS as a starting point and focus what you pointed

EDIT: From what I’ve started reading, Threadripper suffers from the same disease as Ryzen: sweetspot
In fact the phenomenon seems to amplify with the number of dies and as the 2990WX has 4…

Tomorrow I will try to replace this horrible Corsair in 3200 with the Gskill in 4000 by looking for the sweespot of 1600 from the IF

I’m afraid I won’t be able to achieve it because the X399 is not as docile as an X570 or with a lot of testing with DRAM Calc. and putting tons of volts into it.

Have you made any attempts in this direction?

(Future burning atmosphere in sight :S)