I’ve been troubleshooting a test case on an EL9 environment which behaves differently in an EL7 environment.
It’s specifically:
- On an NFS mounts.
- Occurs on our EL9 systems (5.14.0-362.18.1.el9_3.x86_64). Does not occur on any of our older Centos 7 systems (3.10 kernel various versions).
- In a subdirectory. (Doesn’t happen in a ‘top level’ mount).
- The code frequently ‘breaks’ with
os.cwd()
returning ‘no such file or directory’.
Example reproduction code:
#!/usr/bin/python3 -u
import os
import multiprocessing
import time
if __name__=="__main__":
multiprocessing.set_start_method("spawn")
count = 0
while True:
try:
os.getcwd()
pool = multiprocessing.Pool(10)
pool.close()
pool.terminate()
count += 1
except Exception as e:
print(f"Failed after {count} iterations")
print(e)
break
I’m at something of a loss to understand quite what’s going on here, and quite why it fails.
It seems to be connected to pool.terminate()
as if you add even a short (0.05) sleep before that, the problem stops occurring (at least, wasn’t reproducible in ‘sensible’ amounts of time, where the above fails in <10 iterations).
We’ve picked this up due to some of our tests failing on the ‘new version’ on a more simplistic:
import multiprocessing
multiprocessing.set_start_method("spawn")
while True:
multiprocessing.Pool(10)
But as mentioned – specific to NFS mounts in subdirs which we think might be because the struct dentry
behaving a bit differently on root mounts.
But I was wondering if anyone could shed some insight into what might be going wrong here? I’m at a loss to identify whether it might be fileserver, linux kernel, something within python, or… well, somewhere else entirely.