VEAI Performance

suraj · January 6, 2021, 12:02am

Factors influencing the performance of models in VEAI.
Gaia uses GPU completely so CPU performance will have very little impact.
Artemis and Theia use both GPU(80-90%) and CPU(10-20%), so the performance here will depend on the CPU a lot more.
All decoding/encoding is done on the CPU, so an older CPU will lead to slower performance unless using older codecs or uncompressed image formats like TIF.
PCIe bandwidth is very important when doing higher resolution upscaling. If using a PCIe 4 compatible GPU performance will vary a lot between an AMD 3rd gen or above processor supporting PCIe 4 vs any AMD/intel processor/motherboard not supporting it.
Always ensure that you have adequate power supply to power both the CPU and GPU at peak loads.

We try to use as many cores on the CPU as possible but most of the code is optimized for 4-8 core processors, more than 8 cores may not increase performance as expected. This will be addressed in the future.
Similarly RAM usage is optimized for 16-32GB and VEAI will not take advantage of more memory than that.
GPU VRAM optimizations are currently set at MAX VRAM of around 12GB, so if you have a card that is greater than 12GB in VRAM, run multiple instances on the same GPU for better throughput.
Disc speed matters a lot too especially when reading/writing large uncompressed image/video files. But in most cases will have minimal impact.

For NVIDIA only:
Starting with version 1.7 VEAI does not use CUDA for processing and uses DirectML. So the actual GPU usage when checking in task manager will show up under Graphics_1.
GTX performance is about 10-20% lower compared to 1.6.1 and earlier version due to a driver/library issue and will be fixed in the near future.
RTX with its FP16 tensor core support has superior performance than CUDA(2-3x) via TensorFlow, while the entire model is able to take advantage of the FP16 processing, There are certain layers in the models which also take advantage of the hardware level tensor core kernels to get unto 20x speed up. In case of VEAI these layers very few and so overall performance improvements will be still in 2-3x range.
RTX 3000 series performance is still being analyzed and optimized, it will take some months before superior performance can be achieved for the higher core 3080 and 3090 GPUs.

If you have further questions post them here.

shikuteshi · January 6, 2021, 12:35am

Thanks a lot for this information!

Edit:
A suggestion maybe?
VEAI doesn’t make use of more than 12GB of VRAM. In this case instead of depending on the user to be aware of this and manually setting up a 2nd isntance, would it be that much trouble to have VEAI automatically run 2 instances itself if it detects this? Let’s say we have 16GB of VRAM like the latest AMD cards. Automatically split it into 8GB instances and split the load from first half and second half of a given video?
You can’t do much else with that card anyways while it’s under load.

suraj · January 6, 2021, 1:30am

Hopefully in the future the Max VRAM slider would be for individual GPUs rather than the way it is right now. So you would be able to set exactly the maximum VRAM in GBs for your specific GPU. This should also handle situations with higher VRAM GPUs and at that point one instance would be able to take advantage of the entire GPU.
There will still be situations while upscaling from 720p or lower resolutions by 100-200% the VRAM will be low and that is as intended. Just like the program doesn’t take over all the RAM, it will only use the required amount of VRAM.
I would recommend using GPU utilization as the metric to quantify if the GPU is being used completely. Currently there are a few performance bottlenecks between each frame leading to small dips in GPU usage, this will be addressed soon and result in some minor performance boost.
For now VEAI is UI driven with an input/output display every frame, once the UI is redone may be the video will split into different chunks and processed in parallel without having to show current frame in the UI.

Len_Nicholson · January 7, 2021, 2:05am

One thing that I’m not sure is a weakness or something that is parameterised somehow.

I can upscale the video but the audio is always dumbed down to 96kb/s, terrible when high quality audio is in the original.

I don’t understand why the default is no audio anyway, but is there any way to maintain audio quality i.e. not having it reduced to 96kb/s?

imthemarmar · January 7, 2021, 2:29pm

Can I ask a silly question?

Why the move away from CUDA and towards DirectML? I mean, for those of us out here on minimalist systems using the program with lower end GTX cards, this seems counterproductive. 10-20% on a GTX card that can sometimes hit 4-5seconds per frame is utterly useless. It seems like a way to ensure to alienate a good segment of users who will not or can not afford spending $1000 on a video card.

Version 1.6.1 runs just fine on CUDA cores. I see that DirectML can utilize CUDA cores. Is this something you plan on implementing for your users who have lower end cards and rely on their core count for their processing? I just fear that the precision of the processing that CUDA affords is being tossed for speed.

reiner · January 7, 2021, 5:20pm

suraj, thanx for this usefull information !!

We talked about HW-decoding of h264/h265/JEPG/etc via the GPU - above you state that all de/encoding is done by the CPU. Are you planing to implement HW-decoding?

Could you give some insights which parts of the processing are done by the CPU in the cases of Artemis and Theia?

Regards,
Reiner

suraj · January 7, 2021, 5:20pm

I do feel your pain, I have a 980TI and 1080 on my personal machines.
Like I mentioned above the 10-20% loss in performance on GTX GPUs is due to a Microsoft/Nvidia issue that will be resolved soon. That should result in faster performance than the pre 1.7.0 releases even for GTX GPUs.
In our experience we found cuDNN(CUDA implementation of deep learning) to be slower than DirectML(Direct3D/DirectX implementation of deep learning).
With DirectML we can support AMD GPUs, RTX GPUs are 2-5x faster, RTX 3000 series is supported and even older Nvidia GPUs which don’t have high CUDA compute.It also offers faster model load times compared to older CUDA version(1-2s vs > 10s), more stable, less VRAM usage so users can use other graphics intensive programs without crashing, etc. It is a long list of features we like about DirectML.
The only drawbacks due to the bug was a 10-20% loss in performance for the GTX GPUs and lack of support for older windows versions.

Unfortunately, since bigger companies are involved and it is a performance issue on older hardware, it is taking longer to resolve.

reiner · January 7, 2021, 5:33pm

DirectML has so much more potential than “CUDA” only. I am so glad you guys switched Together with ONNX, the posibiliies are soo much better (I think my old TESLA now has no purpose anymore, I doubt i will get it to run now, but hey… it was slow as hell compared to resent GPUs).
If I find time, I will try to get my Radeon 64 to do Tensorflow - OPENML makes this possible

kanav.singla · January 7, 2021, 7:08pm

Hi there,

How can I be your beta tester?

But any way. I have INTEL i9 10980ex processor - RTX3090 24 GB Founders Edition - 128GB of DDR4 ram 3600MHZ .

I don’t why but I could not post the image on the replay.

kanav.singla · January 7, 2021, 7:15pm

Nice one. I have 24 GB of VRAM and it doesn’t even get utilised fully I hope that they solve the issue soon.

shikuteshi · January 7, 2021, 7:39pm

I did get a warning about wanting to be on WIndows 10 2004 update rather than the 1909 version I’m on right now.
But updating puts me on version 20h2 which has sooooo many issues for me.
So if the Windows version is part of the reason I’m seeing pretty poor performance on my GTX 1080 then that’s a little unfortunate. Hopefully it’s resolved soon. Unless someone knows how to target update to 2004 that I just haven’t been able to find.

vktopmanz · January 9, 2021, 1:44pm

This is very interesting. I haven’t seen any benchmarks that can bring out a significant difference between PCIe 3 vs 4 in respect to GPUs. Would be great to see benchmarks.

Here also hoping 3080/3090 performance over the 2080 series can be improved as all other workloads show a large difference.

mxrevolution · January 10, 2021, 6:35pm

From my own tests and benchmarks i know
veai is not using the full capabilities of the new RTX3090, yet.
i posted my detailed benchmarks a month ago here

Everyone can easily see the GPU load with nvidia experience using Alt-R

Running one upscaling instance does not even use 50% of the RTX3090FE gpu
I can easily run 2 instances of 4k upscaling and hardly reach 90% gpu load.
And if i only upscale to 1080p i can easily run 3 veai instances …

Gigapixel AI on the other hand … uses the gpu up to its limits.
I can hardly run VEAI and Gigapixel AI at the same time…
So there must be some potential of boosting VEAI with the RTX 30x0 series.

In the meantime it would be really helpful, as you recommended running more than one VEAI instance at the same time - yourself -, to change the “save.json” concept.

One instance always overwrites the “queue” for the other instances…
So i lost some big queues during VEAI crashes
(playing around with the the crop function sometimes crashes VEAI)

In shorter words… we need a dedicated queue for every instance
Like save1.json save2.jason etc

As i often use the crop function, to render small parts of a video, what is not really in the center
it would be also just great, if the red cropping square was kinda “moveable” …
or just the video under the red square… so i can center the part i want to render.
Gives me some kind of zoom effect. If the “object” is in any corner of the movie,
its almost impossible to do that.

Thanks for you attention

Protogullet · January 18, 2021, 6:55pm

Across three machine, I’m finding that my 6900 XT is roughly 2x faster than my 2070, and 3x faster than the 1080ti, without taking CPU into account; at some point, I’ll have to try exporting to uncompressed images to see how much that affects it - I generally output to CRF12 MP4 for additional processing.

The other benefit I’m seeing from the switch to DirectML is that it’s completely eliminated my issues with crashes, which happened consistently when attempting to export to MP4, but not anymore.

matt.caspermeyer · January 19, 2021, 4:08pm

As far as the UI showing every frame is concerned, there probably should be an option to turn the preview off during conversion so that performance is increased.

I’ve noticed that if I minimize VEAI, that I get a noticeable performance improvement compared to leaving it up on the screen. This is on a Lenovo Flex 14 laptop with the Ryzen 4700U processor and Acer Helios 300 with a GTX 1060.

When it comes to batch conversion of videos, typically, I just set it up, start it, minimize it, and then walk away from that machine since it is best to leave it alone while it is processing, I have found.

Louie · January 20, 2021, 12:35pm

I wish we had some sort of user spreadsheet for GPU and CPU use with bench marking results. Info is scattered all over the forums and unless you spend a lot of time here, it can be hard to keep track.

I say this because I know there are a lot of users waiting for AMD and NVidia to get stock before making an upgrade.

Thanks for your input, every little bit helps

Macadamia · January 20, 2021, 7:31pm

Indeed. I just pre-ordered an Asus Scar 17 laptop with Ryzen 5900HX and RTX 3080 (not exactly a Max-Q as it can go up to 130W and has 16Gb GDDR6). I couldn’t wait for desktop parts availability forever, and the same config as a desktop would have been much more expensive anyways with the same specs.
I also like working on 17" laptops as I don"t have much space for all my PCs on my desk.
I know the mobile 3080 uses the desktop 3070 die, but at least it has 16 Gb and is supposed to be much faster than a desktop 2080 Ti.
I am supposed to receive the machine between January the 28th and February the 5th, and will then do some benchmarks with VEAI. If it is good enough, it will become my dedicated VEAI machine

rdjordjevich · January 24, 2021, 10:59am

I finally got a new machine with a Ryzen 5800X and an RTX 3090. Previously I had been trying VEAI on a laptop with a mobile 2080. Performance is definitely faster, but I’m noticing some behavior that seems “weird.”

Enlarging a 480p video to 4k is giving me 0.2 seconds per frame, but the behavior actually seems to be doing a pulse of 8 frames at that speed and then a bit of a pause (0.5 seconds, or about the time of 2 frames at the speed it’s been going) before another 8 frames get processed. Estimated time to complete a video fluctuates 20%. The video card seems make a bit of noise with each frame, which seems like it’s coil whine? Noise is not as loud when I have VEAI minimized. Hard to tell exactly what’s going on.

Anyone else experience this? Any ideas on what I can do to optimize this? System is brand new and just built so I might not have something configured right.

vktopmanz · January 25, 2021, 7:23pm

@rdjordjevich sometimes VEAI can be inconsistent in terms of performance. Usually a fresh system restart will get the VEAI to utilise the GPU fully and consistently. Other tricks include closing down the NVDisplay.Container.exe(s) running in the background and the same goes for dwm.exe. If all is well, the seconds per frame should be quite consistent. I think some apps can hang onto GPU resources and they are not released impacting VEAI performance.