Unfortunately I can’t install and mess up the host machine. It’s bare bones with almost no software in it. Just a kernel, docker and the nvidia driver. The whole point of containers is to move variation and dependencies into immutable and reproducible objects, in order to simplify host management and solve the “works on my machine
” problem.
This is exactly why I’m using docker in the first place, to be able to run different software that would normally not be possible to run on the same machine due to different library version requirements. Sticking all user space libraries in a container is the whole point of containers; isolating every bespoke application requirement without having to buy different physical servers for each application.
As for library version differences, I did build the image on 11.8 as well, with the exact library versions you listed for TensorRT, and got the same crashes with the same frequencies. As such I expected the cause not to be due to using newer versions.
But I can downgrade the packages again if it helps you rule out one axis of variance at least.
EDIT: Downgraded the base container to 11-8, to alighn with your prior TensorRT stated versions used in-house. Pushed that change to github as well.
I’m getting the same intermittent crashes with this downgrade Gregory 
The libraries of note included in the container are:
$ dpkg -l | grep cuda
ii cuda-compat-11-8 520.61.05-1 amd64 CUDA Compatibility Platform
ii cuda-cudart-11-8 11.8.89-1 amd64 CUDA Runtime native Libraries
ii cuda-keyring 1.0-1 all GPG keyring for the CUDA repository
ii cuda-libraries-11-8 11.8.0-1 amd64 CUDA Libraries 11.8 meta-package
ii cuda-nvrtc-11-8 11.8.89-1 amd64 NVRTC native runtime libraries
ii cuda-nvtx-11-8 11.8.86-1 amd64 NVIDIA Tools Extension
ii cuda-toolkit-11-8-config-common 11.8.89-1 all Common config package for CUDA Toolkit 11.8.
ii cuda-toolkit-11-config-common 11.8.89-1 all Common config package for CUDA Toolkit 11.
ii cuda-toolkit-config-common 12.1.105-1 all Common config package for CUDA Toolkit.
hi libcudnn8 8.9.0.131-1+cuda11.8 amd64 cuDNN runtime libraries
hi libnccl2 2.15.5-1+cuda11.8 amd64 NVIDIA Collective Communication Library (NCCL) Runtime
nVidia guarantees forward and backward compatibility within the same major version (11
in this case) so if we trust that company’s statement, the library versions should now be sorted. As for the driver itself, nVidia also guarantees that all drivers are forward compatible. Meaning if a CUDA program is compiled using driver version N, then a CUDA application will still work for any later driver versions.
The exact libraries used by the topaz ffmpeg are provided in the attached file.
As you can see, all the nvidia stuff is from what you ship. That would lend one to believe that it’d be possible to just install your deb on a vanilla ubuntu container. However I tried that as well, and then got a 100% failure to run any of the models. So clearly there is something the nVidia container provides that you do not ship in the deb package.
I understand you have some RTX 4090 in your lab. Are you able to run the test on one a few times and see if you also get crashes in the preflight phase?
ffmpeg-deps.txt (6.8 KB)