Autoscaling by Submodel

Wanted to float an idea to the community about provisioning docker-machines based on the submodels rather than the model. Currently, it appears that the provisioning of nodes is performed against the total number of images. This leads to a single, large node sized by the number of images. A cheaper (and perhaps faster) approach might be to provision multiple nodes based on the number of submodels. The idea would be implemented as a new node index.js parameter to indicate that multi-node provisioning should occur, and could be based on submodels or any number of machines/types desired by the user. No doubt there are quality impacts that might be mitigated by modifying other parameters, like features, that would negate any performance enhancement.

Please educate me on this since no doubt it’s been discussed.

1 Like

Would the “size” of the submodels (image count) also play into how you’d like it to scale/provision?

That’s the idea. Divide total images by the submodel number of images and round up for the total number of machines. That, or specify the number/type of machines desired in the cluster. Again, as @pierotofy has suggested, there are likely quality tradeoffs that might only be mitigated by more processing, e.g. increasing features.

A couple of other things if there are eyes on this thread:

  1. I’m getting better consistency out of the docker-machine call when using the --native-ssh flag. Before, like many others reporting in posts here, it was hanging on ssh call.
  2. The swap number defaults to 1, which for this day and age is probably wildly too much swap. More or less, you want to keep it in memory. So, in my aws.json file I plan to reduce swap to 1/4. Would be curious to hear reactions and thoughts.
1 Like

I guess the main question is - what am I missing? Are others generating multi-node clusters with auto-scaling?

1 Like

I need help getting auto-scaling going. Reread the doc’s this morning - indeed auto-scaling should be working by splitting the image set based on the splits and spawning the number of node machines accordingly. I’m working in AWS with an Ubuntu AMI using 540 images. With a locked reference node, split @ 100, overlap @ 150 or 20, I get a single node spawned. Nothing out of the ordinary in the debug except the final warning, which is quickly followed by adding the node -

warn: Cannot update info for 107.23.137.180:3000: connect ECONNREFUSED 107.23.137.180:3000
info: Waiting for 107.23.137.180:3000 to get online… (1)
debug: Added node: 107.23.137.180:3000

And the node is successfully added to the cluster.

#> node list

  1. x.x.x.x:3001 [online] [0/1] <engine: odm 2.6.7> <API: 2.2.0> [L]
  2. 107.23.137.180:3000 [online] [0/1] <engine: odm 2.6.7> <API: 2.2.0> [A]

Unfortunately, WebODM promptly fails with the “Cannot Process Dataset” error. If I then restart by hand, processing starts and proceeds.

So, the two things outstanding are that: 1) I only get a single node, and 2) WebODM fails once the node is added to the cluster.

Have read numerous posts from @smathermather and @pierotofy that demonstrate multi-node clusters working along with others reporting multi-node auto-scaling. Any help would be much appreciated.

1 Like

I’ve got nothing other than that debug message smacks of auth/credential issues. I don’t see how or why, since you’ve already done a lot of work to iron those out…

Interestingly, I am able to run the auto-scaler without specifying a split. That said, it only generates a single node but does proceed directly to processing without error.

Consequently, I’m wondering if my troubles have to do with the data set. The set can run in a single node, but not over multiple nodes?

1 Like

I don’t think I’ve ever come across a dataset that has some intrinsic thing that makes it not able to scale…

Kicked off an autoscaling job this morning that launched a single node, installed docker, attached the node, and ran to completion. Banner day. Whoohoo. The culprit appeared to be the ‘splitmerge’ setting in the config-default.json file in the ./ClusterODM root. I checked the master and it has it set to ‘true’. Default is ‘false’ and the doc’s say it should be ‘false’. But it’s sadly misnamed in the JSON and gets renamed to ‘no-split-merge’ later in the code.

Now tracking down node creation. It looks for the smallest machine that can satisfy the image load is chosen based on the configuration of ‘imageSizeMapping’ in (my case) the aws.json config file. You can set max images to anything and see it move among the machine types or fail if none is big enough. This is with split and overlap set. So, there’s a problem. I’m guessing in the task table perhaps, not generating tasks according to the number of submodels? Or, I’m just missing a flag someplace that says “create submodels for processing”. If anyone knows, please let me know.

2 Likes

Hello ,
I have some questions:

How powerful should be main_nodeodm which will combine all sub-models?
In AWS Autoscale Setup will spin only one EC2 instance (power depends of maxImages) , right ?
If i need less powerful nodes but many in numbers , is it possible to be achieved ?

1 Like

Yes, the node architecture is: master node (based on total images) <–> worker nodes (based on split).

The size of both master and worker is determined based on the number of images to process. That is controlled by your asr configuration file in the image mapping portion of the .json. So, if you want finer splits, then provide it with a split number that is lower and generally ODM will use a smaller node and more of them in the worker node cluster.

ODM takes the split you provide as a suggestion in that it will not produce a mutually exclusive segmentation of images because of the need to stitch results back together (overlapping). So, if you say split = 100 on a 300 image set, you might see nodes with, e.g. 120, rather than exactly 100.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.