Aws autoscaling set up useful resources and questions

Hi,

I’m trying to set up the autoscaling in AWS, but so far no luck after reading and trying the previous links in the forum. I would super appreciate it if anyone have more experience that could share with me. It’s a long post but thx for your patience!

The quick question is Do I have to use a web interface to trigger the job? – I know that in this structure pic WebODM is on top of ClusterODM. But from the post that I’ve read I feel like someone is triggering a job by using just command line, (like in a previous post the command docker run -ti -v "$(pwd)/images:/code/images" opendronemap/odm --split 2500 --sm-cluster http://youriphere:3000 is being used.

If the question above is yes, you can skip the following part and jump to the summary of some useful resources I found along my way of testing. If both is no. Then here is what I’ve tried and the error I came across.

My steps are

  1. set up an EC2 instance, 16GB RAM just to be safe

  2. install necessary packages, esp docker and docker-machine

  3. setting up of clusterODM using:

    git clone https://github.com/OpenDroneMap/ClusterODM
    cd ClusterODM
    npm install
    

    *and saved a AWS autoscaling config json in the same ClusterODM folder

    *then run node index.js --asr configuration.json

  4. setting up nodeODM using the docker command

    docker run -p 3001:3000 opendronemap/nodeodm
    

    but changing the port from 3000 to 3001 so that I can use it as a dummy node
    *add and lock the node using

    telnet localhost 8080
    > NODE ADD localhost 3001
    > NODE LOCK 1
    > NODE LIST
    1) localhost:3001 [online] [0/2] <version 1.5.1> [L]
    
  5. download a test image of 52 images that was orthomosaiced easily with ODM before, using the command docker run -ti -v /home/ubuntu/odm/917/code/images:/code/images opendronemap/odm --split 30 --split-overlap 3 --sm-cluster http://172.17.0.1:3000 --debug --verbose > ./output.txt

Here is where I start getting errors.

previously, when clusterODM starts running, it connects to the dummy node a few warnings, no errors
image
I did get a hello.txt file in s3, I don’t know if this is related

when nodeODM starts running, there’s also no errors

However, when I use the docker command to point to processing to the cluster, it says “attempted to autoscale but failed”

In the clusterODM screen, here are the error messages.


and I think the main issue is Cannot create machine: Error: docker-machine exited with code 1

I went to check my EC2 history, there were instances created, but got terminated right away.

Here might be a similar issue with digitalocean.

would this be something related to the bash file mentioned in the configuration file: "engineInstallUrl": "\"https://releases.rancher.com/install-docker/19.03.9.sh\"",

I’m not really familiar with docker-machine and it would be nice if someone can help me to understand it’s purpose here and if there’s a way to get around it.

Another related question is about the dummy node. Why would the images still got processed locally when a cluster is being pointed to? As it was mentioned in the doc

You should always have at least one static NodeODM node attached to ClusterODM, even if you plan to use the autoscaler for all processing. If you setup auto scaling, you can’t have zero nodes and rely 100% on the autoscaler. You need to attach a NodeODM node to act as the “reference node” otherwise ClusterODM will not know how to handle certain requests (for the forwarding the UI, for validating options prior to spinning up an instance, etc.). For this purpose, you should add a “dummy” NodeODM node and lock it:

This way all tasks will be automatically forwarded to the autoscaler.

My understanding is that everything will be sent to autoscaler and get processed accordingly. However, my experience was the first subset got processed locally and the rest was sent to autoscaler.

Summary

links I find useful
great summary about primary machine and secondary machine: How I set up clusterodm
autoscaling on aws
aws autoscaling
autoscale using odm
autoscaling using clusterodm + aws
some odm blogs
nodeODM repo
clusterODM repo

2 Likes

First some thoughts on your questions:

  • You do not have to have WebODM – if you have just ClusterODM and the reference NodeODM you will get a web interface of sorts “for free” at http://youriphere:3000 (it’s the typical NodeODM interface if I understand correctly), which is what your docker run command uses.
  • I was the poster with the DigitalOcean issue and in the end I dodged issues with docker-machine by using the container versions of the ODM packages.
  • The only reason I think your tasks would actually be processed locally is if the node was not locked.

Now about my current approach:

It looks something like this:

  • a locked NodeODM instance
  • a ClusterODM instance hardcoded to use the former as its reference node and with a DigitalOcean-specific ASR config
  • a WebODM instance hardcoded to use the ClusterODM instance as its only active node

I actually had all of this running but broke it making “one last change” before committing things. :frowning: I do have the first two parts working as described, though. To hardcode the instance configurations, I manually configured a setup and copied aside the files to be added to the user-data I use to bring up the whole shebang on a small droplet – 1 vCPU with 2G RAM, if I recall correctly.

The biggest challenge I have right now (once I fix what I broke, heh) is right-sizing the ASR image size mapping. There’s a lot of nuance in which switches have what kind of impact on the multidimensional space (RAM, CPU, disk) required to generate the output, but that merits a separate thread.

3 Likes

Hello - can’t claim to have debugged the problem you’re having but here’s how I run it - following @smather “baby steps” response in one of the posts to start with -

  1. Install WebODM on your primary machine. As both machines are the same size, choose your favorite.
    You will run WebODM without a node. This gives you a little more flexibility:
    ./webodm.sh down && ./webodm.sh update && ./webodm.sh restart --default-nodes 0

  2. Then run a Node separately on each instance. You’ll set your max-concurrency to the number of cores in your machine.
    a. For max images, on the primary machine, you will set this value arbitrarily high:
    docker run -p 3001:3000 opendronemap/nodeodm --max_concurrency 8 --max_images 1000000&
    b. On the secondary node, for 32GB of RAM (assuming you have 32GB of swap), you can set this value as high as 1500 images:
    docker run -p 3001:3000 opendronemap/nodeodm --max_concurrency 8 --max_images 1500

  3. Now you deploy your ClusterODM node on the primary instance, and register your Nodes as described here:
    GitHub - OpenDroneMap/ClusterODM: A NodeODM API compatible autoscalable load balancer and task tracker for easy horizontal scaling ♆ 49. Make sure you are using a different port to the one used by NodeODM.
    Finally, you need to add your ClusterODM node to WebODM, under Processing Nodes → Add New

To start an autoscaling cluster, remember to provide the --config for the auto-scaling

node index.js --config …/<your master config name>.json

Several things to note:

  1. I run things on a machine w/ 32G or higher. I run my base machine on an AWS Linux 2 box.
  2. My config points to an AMI that is Ubuntu. This is because 19.03.9.sh doesn’t handle AWS Linux 2. This means that the machine you spawn with docker-machine will be Ubuntu boxes.
  3. I do use WebODM but agree with the other poster that node is really all you need. But I like using the WebODM interface which includes the Potree viewer.
  4. There are some pretty critical and specific settings in the config file that are good to know. I’m showing 2 config files here, the master config (passed to the index call) and the cluster config. The master config points to the cluster config file.

Master config:

{
“port”: 3000,
“secure-port”: 0,
“admin-cli-port”: 8080,
“admin-web-port”: 10000,
“admin-pass”: “”,
“cloud-provider”: “local”,
“downloads-from-s3”: “”,
“splitmerge”: true,
“cluster-address”: “private IP here”, I use the private IP of my primary box
“token”:"",
“debug”: false,
“log-level”: “debug”,
“upload-max-speed”: 0,
“flood-limit”: 0,
“stale-uploads-timeout”: 0,
“ssl-key”: “”,
“ssl-cert”: “”,
“asr”: “…/awsv2.json”
}

Cluster Config File:
{
“provider”: “aws”,

"accessKey": "<your aws access key>",
"secretKey": "<your aws secret key>",
"s3":{
    "endpoint": "s3.amazonaws.com",
    "bucket": "<your bucket name>"
},
"vpc": "<your vpc id>",
"subnet": "<your subnet id>",
"securityGroup": "<the name of your security group>",  I've failed by using the security group id
"createRetries": 3,
"monitoring": false,
"maxRuntime": -1,
"maxUploadTime": -1,
"region": "<e.g. us-east-1>",
"zone": "<e.g. c>",  mine failed without this being passed to the docker-machine directive, and this zone needs to be able to create the machine types listed in your image mapping below
"tags": ["type,clusterodm"],

"ami": "<your machine image for the workers>",  19.03.9.sh does not handle AWS Linux 2
"engineInstallUrl": "\"https://releases.rancher.com/install-docker/19.03.9.sh\"",

"spot": false,  I've had very little luck actually using spot instances since they can be reclaimed
"imageSizeMapping": [
    {"maxImages": 40, "slug": "t3a.small", "spotPrice": 0.0188, "storage": 300},
    {"maxImages": 80, "slug": "t3a.medium", "spotPrice": 0.0376, "storage": 300},
    {"maxImages": 200, "slug": "m5.large", "spotPrice": 0.096, "storage": 300},
    {"maxImages": 500, "slug": "m5.xlarge", "spotPrice": 0.192, "storage": 320},
    {"maxImages": 1000, "slug": "m5.2xlarge", "spotPrice": 0.384, "storage": 640},
    {"maxImages": 2000, "slug": "r5.2xlarge", "spotPrice": 0.504, "storage": 1200},
    {"maxImages": 3000, "slug": "r5.4xlarge", "spotPrice": 1.008, "storage": 2000},
    {"maxImages": 4000, "slug": "r5.8xlarge", "spotPrice": 2.016, "storage": 2500}
],


"addSwap": 1,
"dockerImage": "opendronemap/nodeodm",
"iamrole":"<your ec2 and s3 enabling role>", needed if you want to write output to cloud storage

}

Other things maybe worth mentioning:

  1. You’re dealing with a pretty small package of images, and the doc’s suggest nothing smaller than 36. That said, I’ve run smaller packages.
  2. the docker-machine command string can be viewed in the asr command inside the telnet to 8080. This is your friend in getting your machines launched.
  3. When you are running, expect 2 rounds of scale out/scale in for your work package. Machines appear and disappear at a frightful rate.
  4. I’ve actually gotten very good results taking the scale up rather than scale out approach particularly with the GPU container @pierotofy has so graciously provided!
  5. There is an AWS team using WebODM now - don’t know them, but they’re using the standalone and have been looking at the AWS cluster as an alternative.

Anyway - hope all this helps!

1 Like

One other thing worth mentioning - for those using autoscaling, it’s worth a deep dive into the tasking process ODM uses as the motivation for creating jobs. It’s worth knowing about tasks, which you can also access in the telnet. A related document is in the OpenSFM docs here on submodeling - Splitting a large dataset into smaller submodels — OpenSfM 0.4.0 documentation

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.