Huge Resource Usage on latest NodeODM/WebODM

I am starting to think that the latest changes to ODM (which appear to parallelise the meshing though I haven’t dived into code yet) are sucking the guts out of my processing resources. I have been trying to process 406 12mp files (a small set since I previously had 4000 20mp running on a mod’ed ODM forked about 2 mths ago). I created a new VM for a “quick” processing job with commercial WebODM script and I am now days in with lots of fails but with no output. I currently have two parallel tasks running (on two separate servers and configs) which are yet to both complete but have been giving alarming statistics:

Task 1 - Ortho Only, No resize. Running on native WebODM (created last week from commercial script), Ubuntu 16.04VM, 128GB Ram, 8 (out of 24) threads. Currently been running 27:20:00 hrs. PDAL maxed out at 4G/thread (no problems - I am sure I recorded that right), translate at 72G (1 thread) , fillnodata currently 71G (57% complete after around 17hrs actual - 1050 min CPU). HOWEVER …/meshing/tmp directory at 558GB! This is about the 10th attempt at this one as the VM kept maxing out HDD. I currently have 192/820G HDD remaining (hopefully enough). [Edit] Still 71G @ 62% (about 22 Hrs/1300 cpu min into fillnodata - 32 hrs overall).

Task 2 (Attempt 1)- Ortho Only, default resize (to 2048 on node). Running NodeODM (Docker on Ubuntu 18.04 Server), 128Gb RAM (Docker alloc), 256G swap (no Docker limit), 12 concurrent threads. Every PDAL process bar one has been terminated by OOM manager. Last PDAL task at 224G(Virt), 125G(Act R) RAM. Maxed out (before OOMs) at 128GB RAM (as set), 140GB swap used. Each PDAL (x12) requested at least 115G (VRAM) and were dropping off OOM between 15-45G actual (as the 128G RAM limit was reached)!. Total swap never exceeded. I killed the task at that point (about 2:30:00 in).

Task 2 (Attempt 2) - As above but 6 concurrent (and Docker swap explicit to 250G). Obviously slower through opensfm. Each PDAL pipeline requested 352G (V) per thread and were dropped OOM as above (as RAM limit reached). Again total swap never reached. The last remaining pipeline stabilised at 125G (actual). This would suggest each pipeline file was around 32G (of Double type). My 4000 image set never exceeded 18G previously. Again I have given up and terminated

[Edit] Task 2 (Attempt 3) - As above but 1 concurrent. PDAL pipeline started at 254G(V)/125G® and soon after failed from memory. According to logs (sorry can’t figure out how to add a non-picture to post) DEM resolution is [3689417,4567954] with PC bounds [minx -902286, maxx 202114, miny -236264, maxy 1131120]. This seems extremely excessive considering that the resolution for my 4000 set point cloud (which was also 2.5cm/px not default 5) was (quoting my test json file) ““bounds”:”([-1074.80603,1054.061157],[-624.1271973,786.5171509])"" or around 85155x56426 px. It also covered approx 400ac vice the current 40ac set

@smathermather-cm or @pierotofy - any idea what might be going on? I gather the latest changes were to make it easier to run larger jobs on smaller machines, but I am now struggling to run a small job on (pretty) big machines. This size job (Ortho only) should be able to run with 32G of RAM or so.

I also have no idea why NodeODM Docker (downloaded on the weekend) seems to be markedly different to WebODM Native (last week) in resources. I was watching Task 1 pretty intently for the first 10 hrs (while watching Indianapolis 500 and Monaco F1 replays) and I didn’t see anywhere near the RAM requirements for PDAL (per thread) for Task 1 (4G V - I wrote that down but am now wondering whether I read it right) as I did Task 2 (up to 352G V)? I didn’t note the actual memory used for Task 1 but it obviously didn’t kill it!!

Besides the RAM issues, for Task 1 each sub-mesh (tif) is about 1.1G on disk, I seem to have 195 sub-meshes for 409 photos (covering around 40ac @ 300’), merged.tmp.tif is 197G and merged.tif is 166G (@55%). Previously (with the 4000 set) my largest file was 112G (I was checking all the tmp files while trying to debug the overflows in PDAL etc). A this rate even my SAN unit won’t have enough disk space for processing the 4000 set (I only have about 5TB total).

The only “data” related issue may be that the photos are OrangeCyanNIR filtered (not RGB) for an NVDI analysis, though this should be transparent to ODM (the actual jpg files are readable as normal, objects are clear, just the “colours” are a bit funky from the filtering!). I assume from the fact you have an NVDI script (from @dakotabenjamin) that I am not the first to try it!

It is worthwhile trying to process on Lightning (I still have heaps of credit) to see what happens there?

I am yet to try on my older mod’ed version of ODM as the VM is on the same host as Task 1 (and I therefore don’t have the memory for concurrent running).

Keen to get this running (and get some metrics) with NodeODM as I am mulling a third server to use with ClusterODM/NodeODM (mainly for the 4000 set), but don’t want to lay down the cash if it is not going to work for me - nor do I want to power 3 DL380 servers (plus SAN) each idling at 5% with one or two threads!

1 Like

This is a bit strange, the big change has been the swap of SMVS with MVE (for dense reconstruction), which has lengthened runtime and perhaps a bit memory usage, but should still be within reasonable numbers.

You could always rebuild the docker images to use a previous version back in time.

It’s always worth a try to see if lightning can process them.

I could be the dataset (although others seem to have used RGN before with unrelated ie EXIF issues) as I ran it (partially - got a memory alloc error - I assume due to size) on a previous version and the point cloud ply was “strange”. Used the opensfm tools to visualise the reconstruction.json and it also didn’t look as clean as I expected (though the points did show a meaningful landscape). Will play around a little more. If it is outlying (incorrect) points then that accounts for the sizing issue.

I have been running it again looking for potential causes. It seems that opensfm is generating points way out of the normal bounds. The filtered point cloud (used for bounds) is about 3Mx3M or 764x764 km, where as the real site is about 500 x 500 m! No wonder it is killing the processing. Continuing to try and identify the issue as zooming in on the ‘real’ data shows that there appear to be extraneous points in all dimensions. I have checked the GPS positions of the photos in another tool - they look OK, a sample of photos would indicate the elevations are correct.

Has anyone successfully used RGN data before (and if so any particular settings needed)?

Abnormally large bounds would explain heavy memory usage.