Thanx for getting back.
It seems, I did a bad job in getting my point across
GCN1 and 2 are indeed very slow in FP16. GCN3 and 4 introduced FP16 with the same speed as FP32. Fiji Cards even are advertised by AMD explicitely for FP16 workloads (running inferences for object recognition on them on the keynote developer demo)…
I explicitely checked for GCN 3 and 4 - FP16 Performance in several scenarios. CLPEAK measures FP16/32/64 scalar and vector and shows exactly the same performance for full and half precision. Tweaking the JSON Files in TVAI to force FP16 usage also shows the same or even slightly higher performace with the FP16 models loaded, compared to the FP32 ones. Using last years benchmark tool from you also shows this, vulkan compute tests with different inferences of Waifu2x, ESRGAN, etc… also prove that this works. I am at the office right now, so I don´t have acces to the screenshots of the results - one test in recent TVAI Beta is shown in my post. All these numbers seen there are results from running FP16 model.
Since vram usage is halfed, bigger tile sizes can be used, which - together with the smaller transfer time for the smaller moedls - actually improves performance in FP16 modes further. While the compute power is not better in FP16 on GCN3 and 4, the bigger tile sizes and faster transfers help in terms of speed.
I tested this intensively on several GCN 3 Cards (Tonga, Fiji) and multiple GNC 4 Cards (RX470,550, etc…) and it worked in every scenario. When I do single Image scaling with some open source image/video software that uses vulkan/onnx/directml, I always switch to the FP16 models.
I am completely aware that focus nowadays is not on old hardware and that any matrix vector thingy like a tensor core gives much higher performance.
But since the change to simply load up FP16 models instead of FP32 either by manual switch oder by automatic detection of GCN3/4 should be very little work to implement and given the fact that there are enormous amounts of users of these cards still out there, I thought it be worth a try recommending.
The last peace of motivation to raise this topic again came from the fact that the speeds achievable actually are not too bad - much better than one would anticipate from typical gaming benchmarks. Your ONNX optimizations work wonderfully and after all - a FJII GCN3 Card nowadays can be bought for around 50 to 80 bucks and is able to deliever almost 9TFLOPS of Fp16 perfromance with very high bandwith of 4096 Bit HBM Memory (~500MB/s) … This clearly outperforms most Pascal cards and in my test bench, using two fiji cards - actually gives speeds in terms better than realtime for clean 1:1 passes on DVD Material. A year ago, we dreamed of speeds like this even when using RTX Cards.
Of course I´d not recommend anyone doing a lot of TVAI work to favour a card like this over a midrange RTX - but there are many people around still using these cards, they still make decent FHD Gaming cards - and feeded with FP16, even FHD inputs are usable… Something like a Radeon PRO DUO with 2 x 8 or 16GB actually then becomes something usable and getting close to speeds of a small RTX card in some scenarios.
If the selection for GPUs was extended to “take card 2 and 3 for computaion” that also would help a lot in this scenario… WIndows and background processes tend to waste VRAM randomly - so setting up a system with an iGPU (which essentially almost every intel CPU of the last ten years has anyway) to take over the display part of the desktop, etc… would leave the possibility to set two “compute cards” to do the processing… At the moment, we only have the possibility to select either one or all cards… If “all cards” is set - one of them always has to deal with “the rest” of the system, leading to performance losses, occasional VRAM shortages, etc…
I imagine that implementing a “select card to use” switch/selector could be done easily.
Thanx for taking the time to read all this