ClusterODM on AWS - slowness

I’ve been running a few maps on AWS, and its been working just awesome. So far ive made a few maps from 200ish photos, and they’ve all taken a couple of hours to finish.
But my latest map, made from 995 images never seems to finsh. 24 hours so far, and nothing much happening.

Im running it using clusterODM using autoscaling for setting up the processing nodes, so its running on a t3a.2xlarge machine (8vcpu/32G ram) But the cpu has been more or less idling the last hours.

DeepinScreenshot_select-area_20200331094955

And the logs from the processing shows nothing much going on:

2020-03-31 01:59:29,526 INFO: DJI_0302.JPG.modified.jpeg resection inliers: 5651 / 5651
2020-03-31 01:59:29,704 DEBUG: Ceres Solver Report: Iterations: 5, Initial cost: 8.765221e+02, Final cost: 3.457865e+02, Termination: CONVERGENCE
2020-03-31 01:59:29,708 INFO: Adding DJI_0302.JPG.modified.jpeg to the reconstruction
2020-03-31 01:59:30,157 INFO: Re-triangulating
2020-03-31 01:59:30,236 INFO: Shots and/or GCPs are well-conditionned. Using naive 3D-3D alignment.
2020-03-31 02:16:55,735 INFO: Shots and/or GCPs are well-conditionned. Using naive 3D-3D alignment.
2020-03-31 04:27:52,562 DEBUG: Ceres Solver Report: Iterations: 20, Initial cost: 7.536832e+04, Final cost: 4.117475e+04, Termination: CONVERGENCE

So the latest update in the log is 3.5 hours ago.

Is this normal behaviour?

By comparing the logs on the working old job on 200ish img and this one, I’ve concluded its not normal beaviour. In the working one, it took 1 minute from the last Ceres Solver Report before it moved on to Undistorting. So I deleted the job, and will do a reflight

1 Like

Possibly low RAM for that sort of job size. I’ve got a couple of jobs running right now on different nodes in the 1500 range that are at about 150-200GB RAM utilised. Perhaps the ODM process was OOM’d and didn’t update the log.

Also I’ve found that in general AWS use the crappest hardware that they can get away with. I’ve seen VMs with 24 core 1.8GHz Intels in them like the E5-2448L. And of course, you’re paying by the minute, so what do they care that your jobs take twice as long because they use garbage CPUs? Bezos truly is a dirt ball when it comes to exploiting every morsel he can out of anything. I’d recommend using Digital Ocean or Vultr over AWS all day long. I use my own hardware, which works out cheapest of all, but seriously, AWS will rob you blind. It makes my blood boil how many AWS evangelists there are out there too - they’re generally either the people that don’t foot the bill, or haven’t done the calculations for TCO.

Run “cat /proc/cpuinfo” to see what you’re on. Here’s one from Vultr for comparison ($6/month node, flat rate):

[[email protected] ~]# cat /proc/cpuinfo | grep MHz
cpu MHz : 3792.000

Yes OOM was my thought as well. Anybody know where I can find the sshkey ClusterODM generates to spin up the processing nodes?

The t3a class uses AMD Epyc series 2.5Ghz

And yes, I do agree with you on AWS pricing in general. What makes running anything on AWS livable, is by using spot instances. Instant 70-90% cost reduction.

1 Like

Couldn’t agree more.

1 Like

If you want to SSH into a node that was auto-spun from ClusterODM, first run:

docker-machine ls

Take note of the instance name and then:

docker-machine ssh <machine name>

It will log you in.

1 Like