BF16 kernels and 5th‑Gen Tensor Core utilization in Topaz Video

Hi Topaz Labs,

I would like to propose adding BF16 inference kernels and full support for Nvidia 5th‑generation Tensor Cores (Ada/Hopper) in Topaz Video.

Today, the RTX 50‑series GPUs expose extremely high AI throughput (FP16/BF16/FP8/INT4), but Topaz Video currently relies mainly on FP16 kernels optimized for Ampere. Because Ada/Hopper introduce:

  • new MMA tile sizes;
  • new warp‑level scheduling rules;
  • new shared‑memory layouts;
  • BF16 fast paths with FP32‑range exponents;
  • FP8/INT4 tensor pipelines;
  • higher SM occupancy requirements;

the existing kernels cannot fully saturate the hardware.

Adding BF16 support and retuning the kernels for Ada/Hopper would allow:

  • higher Tensor Core occupancy;
  • reduced memory bandwidth pressure;
  • larger tile sizes;
  • lower VRAM fragmentation;
  • better CPU < > GPU overlap;
  • significantly higher throughput on RTX 50 GPUs.

This would be especially beneficial for Proteus and Artemis, which are currently CPU‑bound or memory‑bound on high‑end GPUs.

Thank you for considering this request. I think it would unlock a substantial amount of performance on modern Nvidia hardware.

Best regards,
Vincent