This is basically two questions in one post. I recently got ClusterODM working across two local nodes. When testing split-merge across the nodes, it appeared that for some reason, the non-local node (not the one running ClusterODM or the controlling WebODM instance) was getting the bulk of the work, but was the weaker machine. I thought it was because the stronger machine was node #2, but I swapped them in ClusterODM and I don’t think that it fixed that issue (I’m still testing to be sure). This brings up my question:
How does ClusterODM prioritize nodes for processing? Is it strictly the node # in ClusterODM, available queue, and the node’s --max_images parameter’s size? Or is there more to it?
Second, when going through each program’s parameters, I saw that not only does WebODM have the --Split-Merge image count, but also NodeODM has the --max_images parameter. How do these two parameters play together?
If I set a project with 1000 images and --split merge set to 500 images, but the node’s --max_images size is 400 images, what will happen?
If a 500 image project is thrown into ClusterODM with Node #1 having --max_images @ 400 and Node #2 having --max_images @ 450, will the project start with Node #1 or #2?
Any other things to be aware of in regards to setting up distributed processing across multiple nodes iwth these parameters? I’m running a test project or two at the moment to experiment, but would love to hear from someone with more expertise than myself to help clarify the inner workings.
As a test, I ran a set of 451 images on a node set to max_images of 450, and it ran all the way through. I just set the node to 50 images max, and am running a 663 image set, I’m 5 minutes in and it has loaded to node and is finding points. Don’t believe I’ve seen an error yet.
What was your performance like on the larger set? How many node instances did you have running and of what type? I’ve just finished a 661 image set with 4 processing nodes, a cluster node machine, and a webODM instance as well. All are dedicated instances. Of the 4 processing nodes, 2 were r5a.12xlrg (48 vcpu, 393G mem), 2 were r5a.4xlrg (16 vcpu, 131G mem). The machines are running AWS Linux 2. My split was 200 images, and I selected pc-classify, dsm, and dtm. The run took 2:31. With a set of 2 of the smaller cluster nodes and a split of 400, I ran the same set in 3:20. On the larger set, I noticed clusterODM throwing a warning on one of the smaller nodes, “warn: Cannot update info for …connect ECONNREFUSED…” on port 3000 (5 total) and the utilization on that node dropped to zero. Perhaps a db connection? I checked the telnet connections and all were still online using 3000.