This may well be an edge case that can’t be designed around, but I simply don’t understand the interaction between all the pieces well enough to figure it out myself.
On my current processing rig (HP EliteBook 2740p [i5-540m+8GB RAM/24GB SWAP+128GB SSD+Solus 4.1]), it does not take much to push *ODM to the point where physical memory is consumed fully, and SWAP comes heavily into play.
At this point, it seems that WebODM loses connection with NodeODM, throwing the “Processing Node has gone offline” error message, which seems to be incorrect, as the processing continues unabated in the background far after this message is spawned.
Given that the processing (at least in multiple cases here) has continued all the way to the point where I can extract the final “all.zip” from the dockerimage with all products intact (!), is that message not somewhat misleading?
For example the latest screenshots of O’Brien’s Field processing came from this job that WebODM doesn’t recognize as completed.
That makes total sense, but I’m still curious about the how and why of getting WebODM to recognize that the job is still running. I imagine that in the setup you proposed, a bad network connection could also cause a similar drop/disconnect. In that situation, wouldn’t a means of polling NodeODM to re-establish job state/progress make sense?
Similarly, having a folder within the Media-DIR with a completed all.zip should tell WebODM that NodeODM finished the job, even if the two weren’t talking the whole time, right?
We can write all the checks and code we want, if the OS runs out of memory and starts shutting things down at random, that code could be part of it! Then you need to have code that checks the code that checks that checks the code that checks… you get the idea.
Not to say that there aren’t things we could do to improve the situation, but the correct thing to do is to have enough memory.
As I suspected, I need a Sweet Potato instead of a regular potato…
Hmm, if I can demonstrate that the disconnect occurs due to a CPU lock (all threads at 100%) without a corresponding OOM situation (still physical + SWAP), would that change your assessment? What would that mean is occurring? A missing handshake between the two?