M2 Potential Memory Leak

“Out of memory” after 4.5 days (108 hours) on 128 GB RAM, no memory pressure: memory leak?

  1. My 16-day job died at 27% and 12 days remaining. (don’t worry, I have progress backups)
    4h1m of 1080p50 HEVC to 4K50 HEVC (60Mbps) + Apollo (6x) + Gaia (HQ)
    The job started on 2023-07-10 between 01:39am and 02:44am, probably 02:40 (UTC+10)
    The job crashed on 2023-07-14 between 14:09pm and 14:49pm (UTC+10)
    I have been keeping an eye on system resources at all times. No memory pressure.
  2. System profile available on request — this is a Mac Studio M2 Ultra with plenty of everything
  3. New logsForSupport.tar.gz generated: 29.8 MB attached
    Application version is v3.3.3 — not the latest, but it WAS the latest when I began processing :sweat_smile:
  4. See screenshot from 13:44pm (job in progress: 27% and 12d11h1m0s est. remain, 2.9fps)
    and screenshot from 14:49pm same day (job cancelled, “Out of memory”)
    and logs from 14:54pm while the Topaz app was sitting in idle.

I am going to update to v3.3.4 now, and restart the job near the point of failure.
If I don’t report back today, it means the same point in the source footage was fine.

My ‘temp folder’ is also on the system volume, which I’m changing to my scratch disk.
I already have the ‘export folder’ set to a large scratch disk with 300+ GB free space.
Both volumes however had 300+ GB free space at all times. I think the error is RAM, not disk.

logsForSupport.tar.gz (28.4 MB)

2023-07-14 14-27-22.583 Thread: 0x1ddfde440 Info OUT: 10  frame=  1208713  fps=  2.95777

2023-07-14 14-27-23.605 Thread: 0x1ddfde440 Info OUT: 10 Error: command buffer exited with error status.
	The Metal Performance Shaders operations encoded on it may not have completed.
	Internal Error (0000010a:Internal Error)
	<AGXG14XFamilyCommandBuffer: 0x2ca806450>
    label = <none> 
    device = <AGXG14DDevice: 0x12b80f400>
        name = Apple M2 Ultra 
    commandQueue = <AGXG14XFamilyCommandQueue: 0x44db86800>
        label = <none> 
        device = <AGXG14DDevice: 0x12b80f400>
            name = Apple M2 Ultra 
    retainedReferences = 1

2023-07-14 14:27:23 0x44cffb000  CRITICAL: CMBackend output not found
2023-07-14 14:27:23 0x44cffb000  CRITICAL:  Unable to run model with index  0  it had error:  
2023-07-14 14-27-23.633 Thread: 0x1ddfde440 Info OUT: 10 
2023-07-14 14:27:23 0x44cffb000  CRITICAL:  Caught exception in tile processing thread and stopped it msg: std::exception1
2023-07-14 14-27-23.633 Thread: 0x1ddfde440 Info OUT: 10 libc++abi: terminating due to uncaught exception of type std::__1::system_error: thread::join failed: Resource deadlock avoided

Was the last process successful?

I’ve had multiple successes since then, of a similar magnitude. My jobs are easily longer than the mean-time-between-updates haha! So I’ve updated the app 90% of every time I finish a job, LOL

tl;dr yes I think you’ve fixed it, or it was rare enough.

Do new recent versions decrease CPU/GPU usage? Or memory?
I think I clipped the memory usage too much.

Previously I’d have 1 job on the go, but with 4 (max?) I can get the system to beyond 100 GB out of 128 GB memory use. (Yes I am mostly aware of the incredible complexities behind the question "how much RAM are you using?)

1 Like

They should not, however, unintended issues can arise between builds.

That is a lot of RAM! I would leave the max processes to 2 which should be the recommended configuration based on your machine, is that displaying on your end?

This was weeks ago but I vaguely remember it was set to 2 and I thought to myself “but that can’t be me; if I’m on ‘2’ then who’s on ‘4’? Surely that’s wrong” but I love being set straight so please describe the kind of setup that needs ‘4’. Or is it just when you queue up 20 different test sequences for 1 min each?

I have this set to 4 (albeit „only“ 64 Gigs) and yes, this mainly to be able to run several shorter video clips (mainly music Videos) at once and still be able to e.g. check settings for the third with a preview while two are exporting.

1 Like

I do think that this issue may still be in play for some users as we are still seeing similar reports in v3.3.8/v3.3.9.

Have you noticed this again if you are on a newer build?

(If you’re asking me?) I haven’t seen this since, but I’ll add notes here if I ever do.

1 Like