Autoscaling using ClusterODM + AWS

Restarting a thread here. (What’s the policy by the way, using old closed threads or initiating similar new threads?)

On an AWS Linux 2 instance, running WebODM, a single locked node, and ClusterODM. After a few minutes of trying to load to processing node, I get this error and no task output.

Connection error: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))

I have an AWS config file (sanitized) as follows:

Configuration File

{
“provider”: “aws”,

"accessKey": "<access>",
"secretKey": "<secret>",
"s3":{
    "endpoint": "s3.us-east-1.amazonaws.com",
    "bucket": "<my bucket>"
},
"securityGroup": "<my security group>",

"monitoring": false,
"maxRuntime": -1,
"maxUploadTime": -1,
"region": "us-east-1",
"tags": ["type,clusterodm"],

"ami": "<my ami>",
"engineInstallUrl": "\"https://releases.rancher.com/install-docker/19.03.9.sh\"",

"spot": true,
"imageSizeMapping": [
    {"maxImages": 40, "slug": "t3.small", "spotPrice": 0.02, "storage": 60},
    {"maxImages": 80, "slug": "t3.medium", "spotPrice": 0.04, "storage": 100},
    {"maxImages": 250, "slug": "m5.large", "spotPrice": 0.09, "storage": 160},
    {"maxImages": 500, "slug": "m5.xlarge", "spotPrice": 0.19, "storage": 320},
    {"maxImages": 1500, "slug": "m5.2xlarge", "spotPrice": 0.38, "storage": 640},
    {"maxImages": 2500, "slug": "r5.2xlarge", "spotPrice": 0.5, "storage": 1200},
    {"maxImages": 3500, "slug": "r5.4xlarge", "spotPrice": 1.0, "storage": 2000},
    {"maxImages": 5000, "slug": "r5.8xlarge", "spotPrice": 2.0, "storage": 2500}
],

"addSwap": 1,
"dockerImage": "opendronemap/nodeodm"

}

My bucket is the bucket I’ve been manually moving results to from the primary.
My AWS Security group is setup with SSH open to the primary machine and TCP open to 3000-3001.
My AMI is a snapshot of the primary machine.

I think the likelihood is high that I’ve got some IAM work to do in order to allow ODM or ec2-user to launch instances. If someone with this working in AWS would share about their AWS side config, that might help.

Thanks!

1 Like

Update - after revisiting my ec2-user credentials, error message has changed to:

HTTPConnectionPool(host=‘10.0.0.181’, port=3000): Read timed out. (read timeout=30)

10.0.0.181 is the private IP of my primary.

1 Like

Going far afield here, but what does your firewall situation look like, and do you know if anything else is holding onto Port 3000 on that machine?

Nothing else on 3000. Might be a big ask, but if someone knows the IAM config required on the AWS side, might help. I’ve got an ec2-user defined with privs on s3, ec2, and autoscaling. This a credentialed user (no console access). Do I need that PEM file someplace in the ~/?

1 Like

Also, just to answer the firewall (security group) question. I have ssh and ports 3000/1 open to other nodes in the subnet that might appear.

1 Like

A couple of things overnight -

  1. I assume the ClusterODM process needs to be cycled on resets of the aws.json spec file, correct? anything that might not be cleared out by that? For instance a restart of web or node?
  2. Is there a node debug flag that can be passed on startup to provide more insight?
1 Like
  1. Sounds reasonable, but I can not verify.

  2. this may work:
    javascript - Node command line verbose output - Stack Overflow

Has anyone used the docker invocation of ClusterODM to launch autoscaling? If so, what’s the signature for passing in the -asr config.json? I’ve tried it and it does not attempt to autoscale. I’ve also tried running ClusterODM autoscaling from the “node index.json -asr aws.json” with consistent connection errors.

In breaking this down, I’ve tried launching the AMI’s indicated in the asr config file and consistently get an “offline” response when adding them to the node cluster. Don’t know if this is a hint about what’s gone wrong. I’m about to try one of the Ubuntu AMI’s instead of my own.

1 Like

I’m getting an error message when firing up ‘node index.js --asr aws.json --log-level verbose’ - error: Cannot initialize ASR: Docker-machine not found in PATH. Please install docker-machine if you want to use the autoscaler.

I’m in AWS Linux 2 - didn’t think Linux needed a docker-machine.

Here’s the full trace.

node index.js --asr aws.json --log-level verbose
info: ClusterODM 1.5.3 started with PID 19246
info: Starting admin CLI on 8080
warn: No admin CLI password specified, make sure port 8080 is secured
info: Starting admin web interface on 10000
warn: No admin password specified, make sure port 10000 is secured
info: Cloud: LocalCloudProvider
info: ASR: AWSAsrProvider
info: Can write to S3
error: Cannot initialize ASR: Docker-machine not found in PATH. Please install docker-machine if you want to use the autoscaler.

1 Like

Are you able to add docker-machine to your AWS Linux 2 instance?

It doesn’t look like docker-machine has anything to do with what OS is hosting, but more of a toolkit to manage Docker instances locally and in the cloud:

Yes - loaded docker-machine and restarted ClusterODM from the node command -

node index.js --asr aws.json --log-level verbose
info: ClusterODM 1.5.3 started with PID 20636
info: Starting admin CLI on 8080
warn: No admin CLI password specified, make sure port 8080 is secured
info: Starting admin web interface on 10000
warn: No admin password specified, make sure port 10000 is secured
info: Cloud: LocalCloudProvider
info: ASR: AWSAsrProvider
info: Can write to S3
info: Found docker-machine executable
info: Loaded 0 nodes
info: Loaded 0 routes
info: Starting http proxy on 3000

1 Like

That looks okay so far. No errors, at least.

Just needs some nodes, it looks like.

Progress!

Can’t seem to get past this -

Connection error: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))

I’ve tried manually starting a node from the same AMI specified in the asr config. It has docker loaded, but that is it, and I’m able to spin up and use the node instance. One thing I have noticed though is using instances from the ClusterODM docker image work as expected, but starting ClusterODM with ‘node index.js’ do not. They typically fail even if only given a single node on the local machine. Consequently, I’m wondering if my ClusterODM install is broken. Took quite a while to get that installed with the node-libcurl issues.

1 Like

Is the “engineInstallUrl” attribute and subsequent load of docker likely to collide with what is already installed? Or not viable in AWS Linux 2?

The Url is - https://releases.rancher.com/install-docker/19.03.9.sh

I think it might, especially if something is already on and running and holding those ports that the Docker instance will want to use.

I created 3 different AMI’s from vanilla linux installs. Two, AWS Linux 2 variants, and one RHEL variant. I then linked their AMI’s into the configuration file and tried each one. ClusterODM is being launched with the ‘node index.js --asr…’ signature. I’ve got a single node on the primary machine linked to the local Cluster that is locked - no other nodes. Additionally, in a 319 image set, I have 175 as the split level and a default overlap. The machine itself is a T2.2xlarge with 32 GB ram and a 300 GB drive.

Still getting - Connection error: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’))

I do have the monitoring flag switched although nothing noteworthy in the AWS logs. I’m guessing at this point my security permissions for ec2-user must be lacking. Perhaps some crucial component in autoscaling or spot requests on the AWS side I’m missing?

1 Like

I’m unable to run the docker install script file manually in my instances! Progress! Is there any way that I can install docker on the instance to begin with and not perform the install as part of the instance launch?

1 Like

Why won’t the script run? Is it erroring out on a package name? Permission?

Is your user in the docker group?

Several different errors depending on Linux version - this is a manual execution of the script after logging into the machine as ec2-user. I launched separate instances from spot requests of the AMI’s tailored to each OS. I then putty’d to these instances and did a manually curled the script, chmod it, and executed it.

  1. With AWS Linux and docker installed, it tells you docker is installed and to quit before you hurt yourself (would be nice to accept the existing docker installation)
  2. With AWS Linux no docker, it simply says it doesn’t recognize that OS
  3. With RHEL, got a little farther than #1 but then failed without recognizing the version of RHEL
  4. With Ubuntu - I couldn’t even get logged into the instance as ec2-user to try a manual execution (not sure why an Ubuntu instance launched from AWS isn’t recognizing ec2-user)
  5. Trying a US East ubuntu AMI separately also failed to initiate autoscaling with the “connection failed” message mentioned above.

In the instructions for launching AWS instances here ( ClusterODM/aws.md at master · OpenDroneMap/ClusterODM (github.com) thanks @mateo3d, I was wondering exactly what this meant:

  • Create an IAM account for ClusterODM to use, which has EC2 and S3 permissions.

The only IAM account I have to interact with is ec2-user which has the necessary priv’s and is used to fire up all the ODM processes. I assume script execution on the primary machine will act as ec2-user to attempt autoscaling. But it’s not actually getting that far.

And referenced here in the script - If you would like to use Docker as a non-root user, you should now consider
adding your user to the “docker” group with something like:

  sudo usermod -aG docker $your_user

Remember that you will have to log out and back in for this to take effect!

WARNING: Adding a user to the "docker" group will grant the ability to run
         containers which can be used to obtain root privileges on the
         docker host.
         Refer to https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
         for more information.

EOF
1 Like