Multi-Node Cluster

Does anyone have a working demonstration of Cluster ODM w/ auto-scaling launching multiple nodes and processing sub-models created using split/merge? Does anyone have an example of that running against AWS with Ubuntu AMI’s?

1 Like

Unfortunately, I’ve only done that on Digital Ocean. I think Piero uses Scaleway. But I’m super curious myself: I know there have been folks to deploy on AWS.

1 Like

Well, this is such a key piece of tech to get working for big production lines, it’s hard to let it go! What’s perplexing is that for static collections of nodes, the submodel decomposition and distribution among nodes appears to work as expected. In fact, I’ve dropped 7 hr runs to 2 hrs on 500+ imagery sets. But when autoscaling is being called for using the --asr flag, it’s almost as if split/merge settings are ignored. It adds a node, but processing is as if there are no submodels.

If you (or anyone) can point me to the ClusterODM code that does the task decomposition according to submodels, that might provide a lead on what isn’t happening in the AWS environment. Thanks!

2 Likes

I can’t help much there, but the first thing to test when plugging this into a cloud provider is to look at what’s happening in the ASR configuration on the host, e.g.:

telnet localhost 8080
Trying ::1...
Connected to localhost.localdomain.
Escape character is '^]'.
Welcome ::1:55956 ClusterODM:1.5.2
HELP for help
QUIT to quit
#> help
NODE ADD <hostname> <port> [token] - Add new node
NODE DEL <node number> - Remove a node
NODE INFO <node number> - View node info
NODE LIST - List nodes
NODE LOCK <node number> - Stop forwarding tasks to this node
NODE UNLOCK <node number> - Resume forwarding tasks to this node
NODE UPDATE - Update all nodes info
NODE BEST <number of images> - Show best node for the number of images
ROUTE INFO <taskId> - Find route information for task
ROUTE LIST [node number] - List routes
TASK LIST [node number] - List tasks
TASK INFO <taskId> - View task info
TASK OUTPUT <taskId> [lines] - View task output
TASK CANCEL <taskId> - Cancel task
TASK REMOVE <taskId> - Remove task
ASR VIEWCMD <number of images> - View command used to create a machine
!! - Repeat last command
#> ASR VIEWCMD 500
1 Like

Hi,
Would be higly interest by this. I looked at the config file but not sure what to put in the different values and how to configure it in my Digital Ocean account.
Could you help me with that ? Perhaps in another thread ?

1 Like

Yep - @pierotofy tipped me off to that "asr viewcmd " earlier. The construction of the parameters for the docker-machine command works as advertised. However, the number of images are increased by 100 in my case, from 540 to 640, meaning that it’s looking for a machine that can handle that load. I’ve tested this and it does a great job of using that to provision the machine identified by the “imagesizemappping” field for (spot) instances. Problem is, it’s not considering the split to do this. So, if my split is 100, I need 6 machines rated at handling 100 images for 540 images (6 submodels), not one machine rated at > 640 images.

Which is pretty much why I’m wondering what’s broken about my config/environment that is ignoring the split.

2 Likes

Have you explicitly set the ClusterODM host using sm-cluster?

1 Like

No - where is that done? Here’s my process:

  1. Launch WebODM on primary
  2. Launch reference node on primary
  3. Launch ClusterODM on primary
  4. Add #2 (using primary IP/port) to node list via telnet to 8080, then lock it.
  5. Using WebODM UI, add cluster processing node using primary IP/3000.
  6. Make sure at processing the cluster is used as the processing “node”

Nowhere do I explicitly set sm-cluster. However, in the log file, there is an entry for name: sm-cluster with the value: "URL of my primary machine (http://) ".

This is set in newTask.js according to the public address I pass in the clusterODM launch - node index.js --public-address “http://x.x.x.x:80” --asr “…/aws.json”

1 Like

It’s an option in WebODM (also ODM): Options and Flags — OpenDroneMap 2.6.7 documentation .

1 Like

O.K. - I see the discussion here - Cluster not being used · Issue #1013 · OpenDroneMap/WebODM · GitHub

And --public-address is used for that and apparently sets it in my case. I’ll try setting it explicitly in the clusterODM params.

Used both --sm-cluster and --public-address, separately and together, with both public and private IP’s. I also varied the port used on --public-address. No change in behavior - fires up a single node sized for the entire image set.

1 Like

A node that can handle N images (the total) will always be spun up; this is also referred to as the “master node”. It coordinates and merges the submodel reconstructions.

WebODM → ClusterODM → Master Node (N)

If --split is set, the master node will also automatically receive a --sm-cluster parameter: ClusterODM/taskNew.js at c2c2e8f97b9518055f97e822cd5e5874dbc9be52 · OpenDroneMap/ClusterODM · GitHub)

So the task on the master node, if you select --split in WebODM, will receive --split [value] --sm-cluster [public-address]. You must select split (if you haven’t, the dataset will not split and sm-cluster will not be set).

The master node, if both split and sm-cluster are set, will send submodels for processing to [public-address]. These secondary nodes will have lower image counts (~N / split).

If the secondary nodes are not spinning up, the master node is probably not sending the tasks correctly; check the log of the master node, make sure sm-cluster is set (from the log) and look for error messages there.

The master node MUST be able to reach [public-address]. If not, you have a firewall/network issue.

sm-cluster should show the value of the public IP address of ClusterODM (not of the master node). If you switched it by accident, the master node will simply send tasks to itself (and probably stall forever depending on the queue count limits).

1 Like

O.K. - thanks for that clarification @pierotofy! I created a config-default.json for my clusterODM launch and can consistently get tasks created with sm-cluster and split. Once the master node is launched, I can see WebODM task output indicating it has recognized a splitmerge opportunity and my clusterODM debug starts spewing attempts to launch worker nodes in the cluster.

I’m getting some errors along the way including missing instance ID and key pair name, but the main error appears to be a mismatch between the amazonec2-root-size (160) and the AMI root size (300).

Thanks again for the details!

3 Likes

It’s now working and spinning up worker nodes in AWS! I just reset the minimum storage specified in the image mapping of the AWS instance config to the >= than the storage amount used for the master node. Will revisit if there is a setting on the AMI to allow for creating block storage < the original specification.

Spot instances are, well, a little spotty. So, for those willing to pay for availability, set spot to ‘false’ in the config.

I think at this point I can contribute to some of the doc’s around this perhaps to make them a little more explicit about settings. So, I’ll get on that. Additionally, I think there is a very good TCO analysis that can be done now. DO is the price/performance leader so some results comparing and contrasting these cloud providers can be backed up with cost benefit analysis.

Also, these are Ubuntu instances and there is still the matter of getting the docker install script, 19.03.9.sh, maintained by Rancher updated to support AWS Linux 2. I’ve posted to that community about what I want to do without response so will have to revisit on their github site with an issue and see where that goes.

2 Likes

Fantastic stuff, Karl! Thanks so much.

I’m happy to report it cut processing time 1/2 and wrote the results to S3. Along the way, with a dedicated (not spot) set of nodes, it actually lost one node, but continued successfully to the end.

Awesome stuff ODM dev’s! @pierotofy @smathermather

2 Likes

Nice! Have you had to make modifications to the aws.js driver on ClusterODM? If so, could you open a pull request on GitHub? :pray:

2 Likes

PR coming with mods to /libs/asr-providers/aws.js and /docs/aws.md.

6 Likes