I’ll see how the split goes - fast using my swap space (too many concurrent - trying to free disk space in the VM and add some more!) but that looks like the way to go.
Re-reading the PDAL response (and without diving too far into their code), its seems that they hold the entire point cloud in memory to calculate, and then only stream the writer. Also, they use “double” for calculations ie I need at least 18*8=144Gb RAM for that alone, and not virtual .
PDAL is designed with the philosophy that if you need to process more data, split up your dataset. It keeps the coding of threading out of the project and probably also allows a bit of simplicity in memory management.
I would say if you’re still running out of memory because your running all the data simultaneously in your split portions, limit the number of parallel threads and queue up the rest.
This is fun to follow. Keep up the interesting work!
I am just a little bemused why PDAL, when requiring huge amounts of memory, is tied to physical RAM not virt memory. Unix has had vm for its 50 year life for these issues so not allowing users to utilise it if it meets their personal requirements seems strange.
Anyway, in this run (my first split/merge) I underestimated the memory for each sfm reconstruct thread (at about 15G) hence I have too many threads running - but at least it is running as expected. And while sfm hasn’t completed it is quicker so far (even having to page out - which did stabilise to only 50G). I also expect dense to have issues as noted earlier, but I can always re-run that with less threads if it crashes.
I’ll let you know how it goes as it slowly works through.
Separate question - I could not find whether split-merge uses the gcp file (the info said it wasn’t needed, there was no cmd line parameter for it but the info alluded to it still being able to be used). I put it in ‘opensfm’ but not sure whether it will be used. Is it used in split-merge? A 5m geo error wouldn’t really work (and I spent a far bit of cash on a Trimble Catalyst system to map the property to 30cm).
Regarding the use of GCPs, I’m not certain. But what the docs say regarding it being unneeded is unclear: what should be said is GCPs aren’t needed in order to keep the different submodels aligned, as the alignment is handled through matching and SfM.
I believe for a properly named file in the correct location, it should be picked up by OpenSfM, but @dkbenjamin should be better equipped to answer this.
There were two issues in PDAL (GDAL Utils). The first #2448 was included in the 1.9 release (and was unsuccessful), the other I have just raised (an overflow in an integer multiplication in the same function). I have tested it OK manually with modded forked PDAL 1.9 (not sure quality of output yet). I am still re-testing it within the ODM pipeline (I have modded it to load from my PDAL fork) but it is looking good. I have also requested that the fixed be included in PDAL 1.8-maintenance, as PDAL 1.9 requires GDAL 2.2+, which is only in ubuntugis-unstable (changing to which may have other unintended effects esp wrt those using QGIS etc).
This ‘fault’ would have affected any files <~18Gpx (mine barely tripped the overflow @ 17.9Gpx or 85k x 56k).
I’ll further update when PDAL actions are completed (for future thread completeness) and I’ll raise the associated pull request(s).
I now have a segfault in dem2points having successfully got through pdal and gap-filling.
Looking quickly at the code I suspect that it is almost an identical integer overflow (but the backtrace is less complete) to that in PDAL. In PDAL the identical integer product (pixel width x height) overflowed with the large size. More work needed to confirm but I suspect that we need to type cast arr_width to long in line 165 to avoid overflow?
Looks like a neat way to reduce the memory load in the longer term.
I will look at the current dem2points to see whether I can overcome the segfault now (and get my own job completed!). That way hopefully it will work on larger sets in the short-term (given enough memory). The final PDAL fix was neater (actually just more professional) than my original so I’ll try and base the change that - and raise a pull request if it appears OK.
With PDAL fixed it allowed me to run manually in both streamed and non-streamed. Both used the same RAM (72G) for the 18Gpx file. Didn’t properly time it so not sure of the speed difference (but it was only around 20-30min anyway in a pipeline where opensfm is around 60 hrs!).
With the PDAL overflow issues, the options to incorporate the fixes need to be looked at. Given that ODM would need to move to GDAL 2.2+ for PDAL 1.9+, and for 16.04 this currently means using ubuntugis-unstable (with possible/probable side effects), I’d suggest there are probably two main ways ahead:
Update ODM to use PDAL 1.8-Maintenance branch. I have asked PDAL to include the fix in that branch. Nice and neat but would theoretically insert an unstable branch into ODM. I don’t think the maintenance branches are overly volatile (I suspect most change is going into the main repo and only critical fixes going in there) but there is a risk.
Fork PDAL 1.8.0 into OpenDroneMap, add the two changes, and use the fork for ODM. Keeps control but adds overhead.
Happy to raise an issue on this one if you would like.
Your changes to dem2points appear to work (did you pull my mods before doing the data handling or just come to the same conclusion to fix). My pipeline has passed meshing as we speak.
The PDAL bug is fixed via issue #2456. I forked PDAL 1.8-Maintenance, applied the change (#2456 includes #2450 - one file GDALUtils.hpp original commit 93fd02) and used my PDAL fork in ODM as the only change (it also used the stable GDAL 2.1.2) and it was OK. I think that might be the way to go in the short term (ie option #2 above).
I am not sure that GDAL 2.2+ will necessarily move to 16.04 stable in a hurry. GDAL for 18.04+ is pretty up to date in the main ubuntu repo (2.2+) and nothing has been done in ubuntuis stable (for GDAL) since 2017. So I think a PDAL 1.8-Maint fork and patch with #2456 is probably the way to go.
Well although the pipeline ran without interruption, the output was not correct. It appears that the PDAL output is correct (looks like a sparse point file across the whole area!) but the output from the gap filler is not (neither are the other subsequent 2.5d outputs). Looking at mesh_dsf.tif (an image for info using hillside shading
) shows that only the first 10% or so seems to have been processed (and that looks correct). That would be roughly 2GB (indicating another 32bit integer issue somewhere) but other smaller image sets I have run would have been over that as well, and I can not find anything in gippy/numpy that would indicate any 32 bit issues with the arrays.
I’ll try and play around with the code to output some more “intermediate” info but would appreciate any pointers to narrow down the problem
It’s possible that something is overflowing when we do gap-filling with numpy / gippy. Nothing specific comes to mind.
I would recommend to try using the mvs_and_smvs branch (https://github.com/OpenDroneMap/ODM/pull/921). We have just changed the gap-filling method in that branch, as well as tiled DEM generation, so perhaps you’ll be able to go further?
Yes - the median filtering using scipy.signal.medfilt() in opendm/dem/commands.py has an underlying C routine that uses int_p to index the array (which is this case is just over 32bit length at is it width*length). I replaced that with scipy.ndimage.median_filter() - which I assume what was alluded to with the comment “There’s another numpy function that takes care of these edge cases, but it is slower” - which I think is all python (I didn’t dig too deep) and the output was OK in my testing code. I am now re-running the pipeline from meshing (again) with the replaced function!
I’ll have a look at the new branch once I succeed with the current run (or fail and give up!) and give it what appears to be a real stress test with this data set.
This is great to know. If medfilt has indeed a size limit, we should check the size of the image and switch to the slower algorithm if the limits are exceeded.
Now I have issues in either dem2points or PoissonRecon. I suspect it is the latter. I can’t read dsm_points.ply as it is too big, but the number of vertices (568M) sounds about right, but the PoissonRecon output is is only 27K vertices. More work but progressing.
Success. I have a good ortho from the dataset. PoissonRecon 11.1 (the current as of yesterday) has a BIG_DATA flag in preprocessor.h that when set works on big data files. So I also needed to fork/modify for that. To implement my changes you would need to fork both PDAL and PoissonRecon and make the small changes to each. If you think it is worth it (noting your other work) I’ll raise an issue. Otherwise they are in my forks (mikethefarmer).
I might raise another separate thread on here to provide my context and what I did in case anyone else comes looking (esp if you don’t see a need to change main ODM due to upcoming work overtaking it). It will also give a context to some of the “how many images, what size, how big machine etc” questions that are regular (and that I also had).
Thanks for the assistance on this - I will try and get time to look at both the split-merge (which failed on my set) and the mvs_sms branch stuff once I get a bit more farm stuff done (including a new set of photos!).