Error in ODM meshing cell - pdal pipeline

Hi All,

I was processing images using ODM from native install… I am using the latest version pulled from github. I ran into this error at ODM_meshing stage:

e[94m[INFO]    Running ODM Meshing Celle[0m
e[92m[DEBUG]   Writing ODM Mesh file in: /datasets/code/odm_meshing/odm_mesh.plye[0m
e[92m[DEBUG]   running /code/SuperBuild/src/PoissonRecon/Bin/Linux/PoissonRecon --in /datasets/code/smvs/smvs_dense_point_cloud.ply --out /datasets/code/odm_meshing/odm_mesh.dirty.ply --depth 12 --pointWeight 0.0 --samplesPerNode 1.0 --threads 12 --linearFit e[0m
e[92m[DEBUG]   running /code/build/bin/odm_cleanmesh -inputFile /datasets/code/odm_meshing/odm_mesh.dirty.ply -outputFile /datasets/code/odm_meshing/odm_mesh.ply -removeIslands -decimateMesh 1000000  e[0m
e[92m[DEBUG]   Writing ODM 2.5D Mesh file in: /datasets/code/odm_meshing/odm_25dmesh.plye[0m
e[92m[DEBUG]   ODM 2.5D DSM resolution: 0.1028e[0m
e[94m[INFO]    Created temporary directory: /datasets/code/odm_meshing/tmpe[0m
e[94m[INFO]    Creating DSM for 2.5D meshe[0m
e[94m[INFO]    Creating ../datasets/code/odm_meshing/tmp/mesh_dsm_r0.145381154212 [max] from 1 filese[0m
e[92m[DEBUG]   running pdal pipeline -i /tmp/tmpFIfj5V.json > /dev/null 2>&1e[0m
Traceback (most recent call last):
  File "/code/run.py", line 47, in <module>
    plasm.execute(niter=1)
  File "/code/scripts/odm_meshing.py", line 108, in process
    method='poisson' if args.fast_orthophoto else 'gridded')
  File "/code/opendm/mesh.py", line 35, in create_25dmesh
    max_workers=max_workers
  File "/code/opendm/dem/commands.py", line 38, in create_dems
    fouts = list(e.map(create_dem_for_radius, radius))
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 786, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 589, in result_iterator
    yield future.result()
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 381, in __get_result
    raise self._exception
Exception: Child returned 139

This was caused directly by 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 410, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 329, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/code/opendm/dem/commands.py", line 92, in create_dem
    pdal.run_pipeline(json, verbose=verbose)
  File "/code/opendm/dem/pdal.py", line 232, in run_pipeline
    out = system.run(' '.join(cmd) + ' > /dev/null 2>&1')
  File "/code/opendm/system.py", line 34, in run
    raise Exception("Child returned {}".format(retcode))
Exception: Child returned 139
"""

I also tried docker image(latest)…which gave the same error as above.

The command I gave was -

docker run -it --rm -v “/home/garlac/ODM_dev_version/ODMprojects/Docker_projects/MLA_3nov_v1:/datasets/code” opendronemap/opendronemap --mesh-point-weight 0 --force-ccd 12.8333 --texturing-nadir-weight 32 --smvs-output-scale 1 --ignore-gsd --orthophoto-resolution 2.57 --mesh-size 1000000 --time --mesh-octree-depth 12 --opensfm-depthmap-method “BRUTE_FORCE” --crop 0 --resize-to -1 --project-path ‘/datasets’

Earlier I didnt get such an error when --use-3dmesh parameter was used!!

Is it a bug or some other problem.?

Thank you.

Hey @garlac :hand: Does it fail for with every dataset or just certain ones?

I am running it on two other datasets, will let you know once it gets completed.

Hi @pierotofy,

It is failing on only certain datasets…not sure of the reason!

The images of one such dataset where it failed is shared in below link -
https://drive.google.com/open?id=14opb07XR643N5UNJiF5x5Be1qsHqp3Mx

I see no resolution here to this one so i’ll add to it.

I have just got a very similar SEGFAULT when running the pdal pipeline. I am processing a very large set (3924 images) ortho only, through WebODM. I am running natively on 16.04.05 with 144GB RAM and 250 GB swap. The console output was:

[INFO] Creating DSM for 2.5D mesh
[INFO] Creating …/www/data/f5712d28-ab02-4e19-88dd-dcf80c8cd3df/odm_meshing/tmp/mesh_dsm_r0.0707106781187 [max] from 1 files
[DEBUG] running pdal pipeline -i /tmp/tmpRvQULq.json > /dev/null 2>&1
Traceback (most recent call last):
File “/code/run.py”, line 47, in
plasm.execute(niter=1)
File “/code/scripts/odm_meshing.py”, line 108, in process
method=‘poisson’ if args.fast_orthophoto else ‘gridded’)
File “/code/opendm/mesh.py”, line 36, in create_25dmesh
max_workers=get_max_concurrency_for_dem(available_cores, inPointCloud)
File “/code/opendm/dem/commands.py”, line 34, in create_dems
fouts = list(e.map(create_dem_for_radius, radius))
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 794, in _chain_from_iterable_of_lists
for element in iterable:
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 589, in result_iterator
yield future.result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 433, in result
return self.__get_result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 381, in __get_result
raise self._exception
Exception:
Child returned 139

This was caused directly by
“”"
Traceback (most recent call last):
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 418, in _process_worker
r = call_item()
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 272, in call
return self.fn(*self.args, **self.kwargs)
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 337, in _process_chunk
return [fn(*args) for args in chunk]
File “/code/opendm/dem/commands.py”, line 83, in create_dem
pdal.run_pipeline(json, verbose=verbose)
File “/code/opendm/dem/pdal.py”, line 163, in run_pipeline
out = system.run(’ ‘.join(cmd) + ’ > /dev/null 2>&1’)
File “/code/opendm/system.py”, line 34, in run
raise Exception(“Child returned {}”.format(retcode))
Exception: Child returned 139
“”"

As you can see pretty much the same as above. Syslog indicated that the actual SEGFAULT occurred in pdal:

Mar 19 01:28:37 webodm-1 kernel: [203542.649344] pdal[15359]: segfault at 7efd08f61268 ip 00007f1309e49778 sp 00007ffc12f49300 error 4 in libpdal_base.so.6.1.0[7f1309be2000+49e000]

I have rerun this a couple of times with differing settings (eg max concurrency = 1, -use-3dmesh) with no change. I have also re-run the whole pipeline twice (as it is autocleared after 2 days) which is painful (as opensfm takes > 50 hrs on the dataset) with the same outcome.

I have been memory logging (just outputting ‘free’) and at the time of the crash memory was only at 50% RAM (no swap used). I am certainly not a programmer (just a retired engineer so I am not sure I am reading this correctly) but I did observe on the last run from meshing in WebODM (as it crashed within minutes) the ‘top’ output. pdal appeared to allocate about 72G of memory (in the process line - initially 36 then 72) and then proceed to use it up quite quickly (based on the ‘free’ line) but not request any more and crash. I did reboot and re-run to try and see whether there was an issue releasing cach/buffer memory but it still crashed with about 68Gb RAM completely free. Is there either a hard memory limit (either via coding or compiling) or a memory allocation bug in libpdal_base?

I have run a subset of these photos (around 2400 from memory) previously with no issues. While I can probably divide the set into two I would prefer to process as a whole so looking for a solution.

Thanks for sharing this information! I’ve actually seen this error more than a few times but haven’t been able to pin-point its exact cause.

It would be interesting to see if tweaking the file:

/tmp/tmpRvQULq.json

Leads to any improvements and then re-running:

pdal pipeline -i /tmp/tmpRvQULq.json

p.s. for monitoring memory check mprof run and mprof plot: https://unix.stackexchange.com/questions/554/how-to-monitor-cpu-memory-usage-of-a-single-process

Thanks Piero.

Not sure best how to best “tweak” the json file as the only parameters are resolution and radius (I know where the res comes from but not sure how you get the radius in the WebODM pipeline and what I should try it at. I have been down the rabbit hole of pdal, then points2grid (where this apparently originated) then gdal. I struggled to find the core code that pdal was running but I think it is in gdal (gdal_grid?) .

Interestingly the original points2grid had a memory setting and max size setting (compile time) to determine whether it is done incore or outcore depending on the size. They were obviously expecting very large datasets. The behaviour of pdal suggests that there is a limit but I can’t get my head around gdal code. Not sure whether it needs more cache or an updated gdal (meaning I’ll need to go unstable - I’m at 2.1.3 with 16.04.05). The apparent memory limit running this is around 50% but the cache is defaulted to 5%. It is not multithreaded as far as I can see so I am stumped.

Is this worthwhile raising in pdal?

Attached is the mprof outputgdal

Yes PDAL uses the gdal_grid API for creating the raster. So that’s a likely bottleneck. https://github.com/PDAL/PDAL/blob/master/io/GDALWriter.cpp#L239

I guess it would be interesting to see how radius affects the memory usage.

Thanks for that.

Having tried a couple of radii (0.6 and 0.35 - the latter is res*sqrt2 which was a recommended radius) with no change, I noticed that the output file appears to have been written prior to the segfault. The file was abut 8.6Gb (85K x 56K which would translate to around 2.1km x 1.4km - about right) and iotop did show a long write. That makes me more confused about the source of the problem. If it was successfully written that would take us to about line 310 - no idea where from there if it completes doneFile().

What would be the next part of the script to manually run to see if the output was ok (I get totally lost in your python scripts!)?

The output file would be stored in www/data/<TASKID>/odm_meshing/tmp/.

The stack trace shows you the line of the next command. (/code/opendm/dem/pdal.py 164)

I further tested with the resolution set to 3 vice the original 2.5. That seemed to run OK with peak memory around 50G. I need to do a server rebuild anyway so I’ll rebuild and then run the entire pipeline with the slightly lower resolution and see whether that does actually avoid the problem. That will be a few days away.

1 Like

Haven’t done the rebuild yet but re-ran full webodm task with res set to 3. After 62 hrs this also segfaulted at the same spot. Not sure why the re-running just the pdal pipeline using res=3 worked last time but not when the whole task was re-run.

I note that ODM uses pdal 1.6 vice the current 1.8. There have been some changes to the gdal-writer in the interim including #1824. I am not sure that this was necessarily a ‘real’ problem (or just a robust testing regime failure) but we are using addPoint and are seg-faulting!. I take it this change was included in pdal 1.7.1 release.

Is there any backward compatibility/regression issues why pdal has not been upgraded in ODM to date (or did it just not raise its head above the radar as being a potential issue requiring work when there are other more important things underway)?

As far as I know, there’s no reason other than time and priorities that PDAL isn’t updated. It’s pretty easy to update: just change SuperBuild/cmake/External-PDAL.cmake to point to a later version and test (and then share a pull request so others can test).

Pull requests are always welcome. Cheers!

Thanks. I went ahead and created a new install and I have done the update to 1.8.0 - it appeared to compile OK and I am 42hrs into the run. So should know this time tomorrow whether that was successful.

Excellent! Can you share the pull request? We’ll start testing on our end with something smaller :slight_smile:

Cheers,
Best,
Steve

Steve,

Done the fork, confirmed the recompile (only some of the usual typedef related warnings) and have done a 600 20Mpx full size ortho-only run OK. Am now running the same set with ‘full’ outputs. In the case of the smaller file numbers PDAL 1.6 was ok anyway.

I’ll wait for the results of the two currently running jobs (a full 600 and ortho-only 3924) before I raise a pull request. That way I’ll have some idea whether that solved the PDAL SEGFAULT as well. I wouldn’t think there was a rush anyway (since no-one was looking at it to being with) but if you want the pull quickly for an upcoming release I will do it ASAP.

Should be done (hopefully) in 12 hrs or so, though I inadvertently shutdown the VM running the big one via WEBODM (native) - so may have lost 12-36 hrs on that. The opensfm reconstruct appears to have resumed on reboot but I haven’t got an update in WebODM :frowning:). If I need to restart from scratch it will be a further 3 days.

Awe, it’s annoying when that happens. We can wait 3 days, that’s fine. :slight_smile:

I ran the 600 picture 3D set through the pipeline, sending a dense point cloud file to pdal at which point it immediately failed. Rather than the straight segfault (with 1.6) it does give a better error output of:

PDAL: writers.gdal: Can’t shift existing grid outside of new grid during expansion.

I think this is related to the stream mode which is now the default if the file is streamable. Adding ‘–nostream’ to pdal.py seemed to fix that. I will raise a pull request for the changes now to move to 1.8.0, and will continue to work on my large dataset to see whether it will successfully run with 1.8. That may take some time as with the rain overnight I need to get a hay crop planted this week!

The default pdal stream mode may well help with large data sets. I will try and build a python subprocess to extract the bounds (‘pdal info’) and try streaming. My python is poor (and I still haven’t got my head around ODM) so it will take me some time. It also takes quite some time to run ‘pdal info’ to actually get the bounds for a large file (the dense point cloud from 600 pics took me 15 min)! But it may well save memory for large sets.

For info they currently have an RC for pdal 1.9 but this requires gdal 2.2. The last gdal for 16.04 (in repo) is 2.1.3 so I suspect there will be a lot more work to move to 1.9. This may be more appropriate once a stable release for 18.04 is sorted. There was an indication of a segfault that was fixed in gdal 2.2.3 that may be related to my issue here but if that is the case it can wait (I’ll split my dataset!!!).

1 Like

Possibly the second to final update on this one. PDAL 1.8 did not solve the segfault issue, neither did updating GDAL to 2.2.2 (using ubuntugis-unstable) and using PDAL 1.9RC1.

I did try a manual streaming run but I could not see any difference in processing (and ended with the same segfault). I raised the issue on PDAL gitter but just got the “pay us to fix it” response - but in all fairness a bit more explanation on memory usage but no enough to give me any pointers. So as far as I can tell there is an inbuilt coding fault/feature in PDAL - even with exponential growth I cannot fathom why a file 2/3 the size consumes around 50Gb RAM (no-stream) but for the full size file 600Gb virtual memory (144 RAM/450 PG) is inadequate, even in alleged streaming mode! I suspect the actual PDAL processing (as opposed to the pipeline writing) is not streamed - assuming they calculate the data then use the streaming to write the bands (which in this case is only one band anyway).

Once the autumn tasks are done and I get some more time I might to and delve a bit deeper into the PDAL code.

I have started a split-merge to see whether this might help. Again my python is poor but it seems that pdal is part of meshing(?) which is done on each submodel(?) so might work around the pdal problem - eventually. As I ran the full script (to avoid a processing pause in the middle of the night) I may OOM due to too many concurrent processes in dense recon (I am back to 128G RAM/144G PG in this virtual machine), but at least I can restart manually from there if necessary (and possibly expand the machine).

Thanks for the support guys. I have wasted more time than expected but perversely enjoying (can’t get the engineer out of the farmer!). I hope I can contribute more moving forward.

It’s awesome to hear your research efforts! Wish we had more users like yourself :slight_smile:

I think the key to making the DEM generation less memory heavy is simply to divide and conquer, split the area in chunks (with a bit of overlap) and then merge it.

The other route is to write a more memory efficient points to DEM writer, but this would be a bit more time consuming and complicated. Sometimes you just have to build your own tools (like we had to do with programs like dem2mesh, none of the libraries worked within the runtime, output or memory constraints required).