WebODM + ClusterODM + AWS scaling == Read timed out

So I’ve been trying out WebODM with ClusterODM, and AWS auto-scaling. It spins up the machine fine, gives it a job, appears to begin processing the job, however in WebODM I see:

Connection error: HTTPConnectionPool(host='<ip>', port=3000): Read timed out. (read timeout=30)

In 3 tests, 1 time the machine didn’t spin up, 2 times the machine did spin up, and appear to get the job fine.

Is this related to this issue? Or something else maybe?

1 Like

This was happening with DigitalOcean droplets as well. On DO it needed forcing docker version 19.03.9 to work. I’ve got the update for DO finished and will test the same change with AWS. Frankly the spot pricing on AWS has DO beat by 50%, I just prefer the DO ecosystem.

4 Likes

Thank you, I really appreciate that. Looking forward to the results!

Well this is embarrassing. When I was testing with AWS I realized the work I did to fix this was totally unnecessary. Just add the following option to your asr configuration:

“engineInstallUrl”: ““https://releases.rancher.com/install-docker/19.03.9.sh””

EDIT: There is supposed to be a \ between each of the “” sets - can’t get it to show on here…

I did a pull request to update the instructions on config file creation. This tests working on AWS and DigitalOcean. Anyone care to try hetzner & scaleway?

2 Likes

what is engineInstallUrl what it do

2 Likes

I can’t seem to find success with that setting. I tried with and without the double quotes (including the \ on the sets). I just get the error that says something like docker-machine exited with error status 1.

For clarity these are the steps I’m doing:

  1. Upload images
  2. After images upload then go to “uploading images to processing node”
  3. Shortly after it says Connection error: HTTPConnectionPool(host='<ip>', port=3000): Read timed out. (read timeout=30)
  4. Meanwhile several minutes later, on the ClusterODM output I see:
info: Trying to create machine... (1)
warn: Cannot update info for <ip>:3000: connect ECONNREFUSED <ip>:3000
info: Waiting for <ip>:3000 to get online... (1)
  1. After a couple mins, I can see the node indeed spins up and gets a job:
#> NODE LIST
1) 127.0.0.1:3001 [online] [0/1] <engine: odm 2.4.3> <API: 2.1.4> [L]
2) <ip>:3000 [online] [1/1] <engine: odm 2.4.7> <API: 2.1.4> [A]

That’s why I linked that issue above. I feel like if it just waited longer, it’d be fine.

1 Like

I’d recommend setting up a custom ami that has your ssh keys built in so you can log into the autoscaled node and check its docker logs in realtime. Have you tried working with all of the WebODM settings exactly the same but change to DigitalOcean as a provider? You don’t have to stick with them long term - just test to see if it’s a platform agnostic issue or something specifically with your config and AWS.

Does your setup work without autoscaling on AWS? If you spin up an instance manually on EC2 and add it also manually in ClusterODM’s telnet interface do jobs process successfully?

Just in case you haven’t already, turn off spot instances for testing. Less variables to worry about :slight_smile:

1 Like

My main worry is it doesn’t even begin to spin up docker-machine until after the timeout.

It does work fine without autoscaling, we tested that first, then moved on to autoscale tests.

I’ll try sshing into the other machine, but it really does seem fine, and I get the error much before the AWS machine is even spun up. Testing without spot pricing might be worth a shot, could be causing a delay (though I can live with the delay, it’d be nice to be able to use spot pricing haha).

I might be able to try DO, but I’m not setup on them at all. We have an AWS grant, so it’s really worth while getting it to work for us.

Spot pricing is hit or miss with autoscaling if you try using more powerful instances. What types are you trying - custom or the defaults in the config? You know how when you do spot requests directly from EC2 you get to choose a pool and it tells you your success probability? Not so with autoscaling. It’s a single type in a single region so there may be a delay or not creating them on the first few requests (sounds like you possibly experienced that already). Play around with your EC2 types and pool limiting on the spot request tool to see how wildly the spinup times vary. It may help to go to a different region or process in off hours, but at that point I’d rather just pay full price.

Since AWS works without autoscaling for you already my bet is the spot instance requests are causing too much delay for ODM. Would be super cool if we could do spot pools. Excited to hear your results!

2 Likes

Turns out we absolutely can do spot pools.

@Discordian - Any chance you could help test this pretty please? The spot fleet configs can get nutty!

3 Likes

Ok so after a bunch more testing I don’t think the original problem wasn’t spot pool delay related. This extra commit fixes it for me though:
https://github.com/OpenDroneMap/ClusterODM/commit/f9f71d77e28d59030168301e3a9442d609d39de2

3 Likes

Thanks for these! I’ll first check without spot pricing, and keep this feature in mind for when we use spot pricing!

Edit: Without spot pricing, there’s still a massive delay. I’m going to modify ClusterODM to dump way more output to figure out exactly why. It’s odd you don’t have the same experience as I do. Maybe it’s because my data directory is on EFS?

1 Like

I got a little excited with the spot pool stuff and closed that pull request. Check out the new one with just a quick fix for aws autoscale. The only changes that matter are in aws.js

https://github.com/OpenDroneMap/ClusterODM/pull/71/files#diff-f7ff5549b357fab1842355df87aae5e23e8eddba23709e89a02a661cd9ac063e

2 Likes

Wow is it really as simple as removing the , false entries? I’ll take that for a spin now.

1 Like

I have the exact same result, I’ll drop more debugging into here, if I don’t report back today, I’ll report back tomorrow. Find exactly where the bottleneck is.

Would be nice though if it handled long delays like this better, maybe I’ll look into Connection error: HTTPConnectionPool(...): Read timed out. · Issue #775 · OpenDroneMap/WebODM · GitHub and see if I can tackle it.

Edit: Sorry I forgot today is Friday. I actually thought it was Thursday all day. I’ll likely report back in on Monday!!

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.