“Out of memory” after 4.5 days (108 hours) on 128 GB RAM, no memory pressure: memory leak?
- My 16-day job died at 27% and 12 days remaining. (don’t worry, I have progress backups)
4h1m of 1080p50 HEVC to 4K50 HEVC (60Mbps) + Apollo (6x) + Gaia (HQ)
The job started on 2023-07-10 between 01:39am and 02:44am, probably 02:40 (UTC+10)
The job crashed on 2023-07-14 between 14:09pm and 14:49pm (UTC+10)
I have been keeping an eye on system resources at all times. No memory pressure. - System profile available on request — this is a Mac Studio M2 Ultra with plenty of everything
- New logsForSupport.tar.gz generated: 29.8 MB attached
Application version is v3.3.3 — not the latest, but it WAS the latest when I began processing - See screenshot from 13:44pm (job in progress: 27% and 12d11h1m0s est. remain, 2.9fps)
and screenshot from 14:49pm same day (job cancelled, “Out of memory”)
and logs from 14:54pm while the Topaz app was sitting in idle.
I am going to update to v3.3.4 now, and restart the job near the point of failure.
If I don’t report back today, it means the same point in the source footage was fine.
My ‘temp folder’ is also on the system volume, which I’m changing to my scratch disk.
I already have the ‘export folder’ set to a large scratch disk with 300+ GB free space.
Both volumes however had 300+ GB free space at all times. I think the error is RAM, not disk.
logsForSupport.tar.gz (28.4 MB)
2023-07-14 14-27-22.583 Thread: 0x1ddfde440 Info OUT: 10 frame= 1208713 fps= 2.95777
2023-07-14 14-27-23.605 Thread: 0x1ddfde440 Info OUT: 10 Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000010a:Internal Error)
<AGXG14XFamilyCommandBuffer: 0x2ca806450>
label = <none>
device = <AGXG14DDevice: 0x12b80f400>
name = Apple M2 Ultra
commandQueue = <AGXG14XFamilyCommandQueue: 0x44db86800>
label = <none>
device = <AGXG14DDevice: 0x12b80f400>
name = Apple M2 Ultra
retainedReferences = 1
2023-07-14 14:27:23 0x44cffb000 CRITICAL: CMBackend output not found
2023-07-14 14:27:23 0x44cffb000 CRITICAL: Unable to run model with index 0 it had error:
2023-07-14 14-27-23.633 Thread: 0x1ddfde440 Info OUT: 10
2023-07-14 14:27:23 0x44cffb000 CRITICAL: Caught exception in tile processing thread and stopped it msg: std::exception1
2023-07-14 14-27-23.633 Thread: 0x1ddfde440 Info OUT: 10 libc++abi: terminating due to uncaught exception of type std::__1::system_error: thread::join failed: Resource deadlock avoided