TVAI engine crashes ~4/5 times on Linux (3.4.4.0.b)

It takes on average ~5 attempts to launch tvai using your provided ffmpeg binary on Ubuntu 22.04. 4/5 times it crashes. On 3.4.4.0.b it crashes silently, while earlier betas provided useful hints and stack traces.

Since the crash frequency is unchanged compared to earlier betas, and those betas provided a stack trace and error that originated from CUDA, with the error message being GPU OOM, I suspect the issue is the same in 3.4.4.0.b.

To reproduce:

git clone https://github.com/jojje/vai-docker.git
make build
make benchmark   # (a number of times)

The following is the command used, and attached are two log files; one showing a crash example (they all look the same, crashing in “Preflight”), and one showing an OK run using the exact same command.

ffmpeg -f lavfi -i testsrc=duration=60:size=640x480:rate=30 -pix_fmt yuv420p \
-filter_complex "tvai_up=model=prob-3:scale=2:preblur=-0.6:noise=0:details=1:halo=0.03:blur=1:compression=0:estimate=20:blend=0.8:device=0:vram=1:instances=1" \
-f null -

Observations:

  • The crashes seem stochastic.
  • When a crash happens, there is an increased probability that succeeding attempts will also result in crashes.
  • When leveraging the GPU with other programs (like launching nvtop) the probability of a successful VAI launch is significantly increased.

Most required hardware details can be found in the logs, as they’re logged by the VAI engine itself.
What’s not logged, but may be useful additional information is:

  • OS: Ubuntu 22.04
  • Kernel: 5.15.0-82-generic x86_64 (stock ubuntu kernel)
  • nVidia driver: 535.54.03

vai_3440b_linux_logs.tar.gz (8.5 KB)

Based on the docker file you’re using, there may be some conflicts there. You will want to downgrade to CUDA 11.8. The bundled copy of nvinfer and cuBLAS is usually enough by itself, so you might not even need to base off the CUDA container if you’re able to get the GPU visible otherwise.

I’m not sure how many people we have running this in Docker, so I’d also be curious if you’re able to reproduce the issue on bare metal?

Unfortunately I can’t install and mess up the host machine. It’s bare bones with almost no software in it. Just a kernel, docker and the nvidia driver. The whole point of containers is to move variation and dependencies into immutable and reproducible objects, in order to simplify host management and solve the “works on my machine :person_shrugging:” problem.

This is exactly why I’m using docker in the first place, to be able to run different software that would normally not be possible to run on the same machine due to different library version requirements. Sticking all user space libraries in a container is the whole point of containers; isolating every bespoke application requirement without having to buy different physical servers for each application.

As for library version differences, I did build the image on 11.8 as well, with the exact library versions you listed for TensorRT, and got the same crashes with the same frequencies. As such I expected the cause not to be due to using newer versions.

But I can downgrade the packages again if it helps you rule out one axis of variance at least.


EDIT: Downgraded the base container to 11-8, to alighn with your prior TensorRT stated versions used in-house. Pushed that change to github as well.

I’m getting the same intermittent crashes with this downgrade Gregory :frowning:

The libraries of note included in the container are:

$ dpkg -l | grep cuda

ii  cuda-compat-11-8                       520.61.05-1                             amd64        CUDA Compatibility Platform
ii  cuda-cudart-11-8                       11.8.89-1                               amd64        CUDA Runtime native Libraries
ii  cuda-keyring                           1.0-1                                   all          GPG keyring for the CUDA repository
ii  cuda-libraries-11-8                    11.8.0-1                                amd64        CUDA Libraries 11.8 meta-package
ii  cuda-nvrtc-11-8                        11.8.89-1                               amd64        NVRTC native runtime libraries
ii  cuda-nvtx-11-8                         11.8.86-1                               amd64        NVIDIA Tools Extension
ii  cuda-toolkit-11-8-config-common        11.8.89-1                               all          Common config package for CUDA Toolkit 11.8.
ii  cuda-toolkit-11-config-common          11.8.89-1                               all          Common config package for CUDA Toolkit 11.
ii  cuda-toolkit-config-common             12.1.105-1                              all          Common config package for CUDA Toolkit.
hi  libcudnn8                              8.9.0.131-1+cuda11.8                    amd64        cuDNN runtime libraries
hi  libnccl2                               2.15.5-1+cuda11.8                       amd64        NVIDIA Collective Communication Library (NCCL) Runtime

nVidia guarantees forward and backward compatibility within the same major version (11 in this case) so if we trust that company’s statement, the library versions should now be sorted. As for the driver itself, nVidia also guarantees that all drivers are forward compatible. Meaning if a CUDA program is compiled using driver version N, then a CUDA application will still work for any later driver versions.

The exact libraries used by the topaz ffmpeg are provided in the attached file.
As you can see, all the nvidia stuff is from what you ship. That would lend one to believe that it’d be possible to just install your deb on a vanilla ubuntu container. However I tried that as well, and then got a 100% failure to run any of the models. So clearly there is something the nVidia container provides that you do not ship in the deb package.

I understand you have some RTX 4090 in your lab. Are you able to run the test on one a few times and see if you also get crashes in the preflight phase?

ffmpeg-deps.txt (6.8 KB)

Did another test using ubuntu:22.04 as the base image, in order to get you a complete set of information on why this doesn’t work.

As you can see from the log, the engine has trouble unzipping the encrypted zip files (tz and tpz) for some reason. Perhaps some library dependency requirement that you’ve forgotten to advertise in your deb package. As a result the engine iterates over all possible permutations of the prob3 model, but fails each time to initialize the ONNX runtime (not surprising since it doesn’t have a valid model).

The final bit is interesting though, and I can’t for the life of me come up with even a hypothesis; It somehow manages to load an fp32 model, and get it running super slow, on CPU. The surprising fact is where the heck did it get that model data from if it’s unable to unzip any tz file?

Anyway, this is why I went with the nVidia blessed containers, since they’re the defacto way to run ML workloads on Linux. And since I actually got your engine to launch on GPU a few times :slight_smile:

ubuntu-22.04-base.log.tar.gz (7.3 KB)


EDIT: forgot to address one misconception in your previous answer:

Docker is bare metal. It’s a tar file unzipped as a filesystem, like chroot. During execution, processes are ordinary linux processes running directly on the host kernel. The isolation is provided by linux control groups (cgroups) and linux namespaces. You can use those primitives without docker on the command line to achieve the exact same effect. Docker is just a glorified command wrapper around running the equivalent of a few commands before exec:ing a process. tldr; No virtualization, these are native linux processes accessing native linux kernel exactly like all other processes on a gnu/linux system.

Thanks for the thorough testing.

For the regular ubuntu container it looks like there isn’t any issue with unzipping the models - the problem appears to be happening when attempting to initialize the CUDA runtime. It looks like the nvidia GPU isn’t being detected inside the container, so the initialization failure isn’t too surprising; can you verify that it shows up when running nvidia-smi inside? The stock ubuntu container might not be using the nvidia runtime (--runtime=nvidia --gpus all somewhere in your docker args) by default.

I’m aware that Docker containers aren’t VMs, but there can be issues, especially when we’re wanting the container to interact with GPUs. Reproducing the issue outside of a container is helpful to rule out those issues. I’ve been able to reproduce the issue on both a VM and one of our physical machines here, so we’ll be looking into it. In the meantime, does the 3.3.9 beta still work for you? I’ve also noticed the issue seems to occur less often when using the GUI, so if you’re able to access that it might be worth a shot. I know you’re on a headless system, but something with X forwarding might be doable?

Based on our investigation so far, the issue seems to be related to the prap-3 model. As a workaround, you may have better luck if you remove the estimate=20 parameter and provide manual parameters, or try another model

We’re still looking into a solution for the actual issue.

@jojje the new beta should fix the issue

I’m noticing probably the exact same issue as jojje but I’m running the 3.5.1.0.b deb on a fresh Ubuntu 22.04.03 install. This is with 2 ARC A770s, this system never had Nvidia GPUs installed so only standard packages exist.

I can’t nail it down as well as he has but I set my model directory to a /home path so I can see what’s going on. The Topaz UI hunts around for models every single time an instance is started. I’m not sure what the expected behavior is but the models are created and deleted one by one inside the model directory. It’s basically impossible to run an enhancement with a frame interpolation. In fact I haven’t gotten a frame interpolation past “downloading” yet.

Could you please send me your log files?

I switched the system over to windows right now. I’ve posted about what I’m seeing there with ARC. Let me mount the Ubuntu drive and see if I can get logs off of it.

Can’t figure out where Topaz stores logs on Linux, can’t find it in /opt or /var. Can you tell me what directory it is

Sure, logs are stored per-user in ~/.local/share/Topaz Labs LLC/Topaz Video AI Beta/logs/

Well it apparently detects the GPU sometimes. 1/5 times when it works, it uses the GPU as evidenced by both the logs and nvtop usage.

Yes of course it does. I wouldn’t be using the nvidia container if it wasn’t. That’s what the container is there for. It’s the basis for all our machine learning related pipelines…

I’d given up on VAI for linux and bought a windows machine for one-off use of VAI. The project that considered incorporating VAI in a pipeline was terminated due to the lack of VAI Linux support, so am still using the old production pipeline that does not include VAI, but our own models.

Haven’t tried VAI on linux for a while now. Any new releases you’ve made that fix the issues you note above, and would make it worth my time to revisit the platform?

I believe we fixed the issue with the prap-3 model in 3.5.1. The 4.0 betas have also updated some other libraries that may help with running the GPU models on some systems, the latest version is here.

We do make use of the Linux versions internally, however our internal use-cases aren’t containerized so I can’t guarantee it’ll work well with Docker.

1 Like

Sounds great Gregory. I’ll test and report back once I find some time.

I’ve now tested 3.5.1 and the latest beta (4.0.7.0.b) and they have a 100% launch success with prob3.

I tested using the standard tvai docker container to ensure reproducibility by running make benchmark with the different versions.

Thank you so much for taking the time to nail this thorny bug. It seems the path is now open to explore running VAI on Linux seriously.

I can confirm that 4.0.7.0.b does launch prob3 successfully. I still error out most of the time when attempting iris on the latest version. Not sure what is different between the models.

I also appreciate the time spent making this product work well on Linux. It’s really very powerful and the models have improved a lot in the last 18 months.