Error in ODM meshing cell - pdal pipeline


#1

Hi All,

I was processing images using ODM from native install… I am using the latest version pulled from github. I ran into this error at ODM_meshing stage:

e[94m[INFO]    Running ODM Meshing Celle[0m
e[92m[DEBUG]   Writing ODM Mesh file in: /datasets/code/odm_meshing/odm_mesh.plye[0m
e[92m[DEBUG]   running /code/SuperBuild/src/PoissonRecon/Bin/Linux/PoissonRecon --in /datasets/code/smvs/smvs_dense_point_cloud.ply --out /datasets/code/odm_meshing/odm_mesh.dirty.ply --depth 12 --pointWeight 0.0 --samplesPerNode 1.0 --threads 12 --linearFit e[0m
e[92m[DEBUG]   running /code/build/bin/odm_cleanmesh -inputFile /datasets/code/odm_meshing/odm_mesh.dirty.ply -outputFile /datasets/code/odm_meshing/odm_mesh.ply -removeIslands -decimateMesh 1000000  e[0m
e[92m[DEBUG]   Writing ODM 2.5D Mesh file in: /datasets/code/odm_meshing/odm_25dmesh.plye[0m
e[92m[DEBUG]   ODM 2.5D DSM resolution: 0.1028e[0m
e[94m[INFO]    Created temporary directory: /datasets/code/odm_meshing/tmpe[0m
e[94m[INFO]    Creating DSM for 2.5D meshe[0m
e[94m[INFO]    Creating ../datasets/code/odm_meshing/tmp/mesh_dsm_r0.145381154212 [max] from 1 filese[0m
e[92m[DEBUG]   running pdal pipeline -i /tmp/tmpFIfj5V.json > /dev/null 2>&1e[0m
Traceback (most recent call last):
  File "/code/run.py", line 47, in <module>
    plasm.execute(niter=1)
  File "/code/scripts/odm_meshing.py", line 108, in process
    method='poisson' if args.fast_orthophoto else 'gridded')
  File "/code/opendm/mesh.py", line 35, in create_25dmesh
    max_workers=max_workers
  File "/code/opendm/dem/commands.py", line 38, in create_dems
    fouts = list(e.map(create_dem_for_radius, radius))
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 786, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 589, in result_iterator
    yield future.result()
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 381, in __get_result
    raise self._exception
Exception: Child returned 139

This was caused directly by 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 410, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 329, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/code/opendm/dem/commands.py", line 92, in create_dem
    pdal.run_pipeline(json, verbose=verbose)
  File "/code/opendm/dem/pdal.py", line 232, in run_pipeline
    out = system.run(' '.join(cmd) + ' > /dev/null 2>&1')
  File "/code/opendm/system.py", line 34, in run
    raise Exception("Child returned {}".format(retcode))
Exception: Child returned 139
"""

I also tried docker image(latest)…which gave the same error as above.

The command I gave was -

docker run -it --rm -v “/home/garlac/ODM_dev_version/ODMprojects/Docker_projects/MLA_3nov_v1:/datasets/code” opendronemap/opendronemap --mesh-point-weight 0 --force-ccd 12.8333 --texturing-nadir-weight 32 --smvs-output-scale 1 --ignore-gsd --orthophoto-resolution 2.57 --mesh-size 1000000 --time --mesh-octree-depth 12 --opensfm-depthmap-method “BRUTE_FORCE” --crop 0 --resize-to -1 --project-path ‘/datasets’

Earlier I didnt get such an error when --use-3dmesh parameter was used!!

Is it a bug or some other problem.?

Thank you.


#2

Hey @garlac :hand: Does it fail for with every dataset or just certain ones?


#3

I am running it on two other datasets, will let you know once it gets completed.


#4

Hi @pierotofy,

It is failing on only certain datasets…not sure of the reason!

The images of one such dataset where it failed is shared in below link -
https://drive.google.com/open?id=14opb07XR643N5UNJiF5x5Be1qsHqp3Mx


#5

I see no resolution here to this one so i’ll add to it.

I have just got a very similar SEGFAULT when running the pdal pipeline. I am processing a very large set (3924 images) ortho only, through WebODM. I am running natively on 16.04.05 with 144GB RAM and 250 GB swap. The console output was:

[INFO] Creating DSM for 2.5D mesh
[INFO] Creating …/www/data/f5712d28-ab02-4e19-88dd-dcf80c8cd3df/odm_meshing/tmp/mesh_dsm_r0.0707106781187 [max] from 1 files
[DEBUG] running pdal pipeline -i /tmp/tmpRvQULq.json > /dev/null 2>&1
Traceback (most recent call last):
File “/code/run.py”, line 47, in
plasm.execute(niter=1)
File “/code/scripts/odm_meshing.py”, line 108, in process
method=‘poisson’ if args.fast_orthophoto else ‘gridded’)
File “/code/opendm/mesh.py”, line 36, in create_25dmesh
max_workers=get_max_concurrency_for_dem(available_cores, inPointCloud)
File “/code/opendm/dem/commands.py”, line 34, in create_dems
fouts = list(e.map(create_dem_for_radius, radius))
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 794, in _chain_from_iterable_of_lists
for element in iterable:
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 589, in result_iterator
yield future.result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 433, in result
return self.__get_result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 381, in __get_result
raise self._exception
Exception:
Child returned 139

This was caused directly by
“”"
Traceback (most recent call last):
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 418, in _process_worker
r = call_item()
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 272, in call
return self.fn(*self.args, **self.kwargs)
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 337, in _process_chunk
return [fn(*args) for args in chunk]
File “/code/opendm/dem/commands.py”, line 83, in create_dem
pdal.run_pipeline(json, verbose=verbose)
File “/code/opendm/dem/pdal.py”, line 163, in run_pipeline
out = system.run(’ ‘.join(cmd) + ’ > /dev/null 2>&1’)
File “/code/opendm/system.py”, line 34, in run
raise Exception(“Child returned {}”.format(retcode))
Exception: Child returned 139
“”"

As you can see pretty much the same as above. Syslog indicated that the actual SEGFAULT occurred in pdal:

Mar 19 01:28:37 webodm-1 kernel: [203542.649344] pdal[15359]: segfault at 7efd08f61268 ip 00007f1309e49778 sp 00007ffc12f49300 error 4 in libpdal_base.so.6.1.0[7f1309be2000+49e000]

I have rerun this a couple of times with differing settings (eg max concurrency = 1, -use-3dmesh) with no change. I have also re-run the whole pipeline twice (as it is autocleared after 2 days) which is painful (as opensfm takes > 50 hrs on the dataset) with the same outcome.

I have been memory logging (just outputting ‘free’) and at the time of the crash memory was only at 50% RAM (no swap used). I am certainly not a programmer (just a retired engineer so I am not sure I am reading this correctly) but I did observe on the last run from meshing in WebODM (as it crashed within minutes) the ‘top’ output. pdal appeared to allocate about 72G of memory (in the process line - initially 36 then 72) and then proceed to use it up quite quickly (based on the ‘free’ line) but not request any more and crash. I did reboot and re-run to try and see whether there was an issue releasing cach/buffer memory but it still crashed with about 68Gb RAM completely free. Is there either a hard memory limit (either via coding or compiling) or a memory allocation bug in libpdal_base?

I have run a subset of these photos (around 2400 from memory) previously with no issues. While I can probably divide the set into two I would prefer to process as a whole so looking for a solution.


#6

Thanks for sharing this information! I’ve actually seen this error more than a few times but haven’t been able to pin-point its exact cause.

It would be interesting to see if tweaking the file:

/tmp/tmpRvQULq.json

Leads to any improvements and then re-running:

pdal pipeline -i /tmp/tmpRvQULq.json

p.s. for monitoring memory check mprof run and mprof plot: https://unix.stackexchange.com/questions/554/how-to-monitor-cpu-memory-usage-of-a-single-process


#7

Thanks Piero.

Not sure best how to best “tweak” the json file as the only parameters are resolution and radius (I know where the res comes from but not sure how you get the radius in the WebODM pipeline and what I should try it at. I have been down the rabbit hole of pdal, then points2grid (where this apparently originated) then gdal. I struggled to find the core code that pdal was running but I think it is in gdal (gdal_grid?) .

Interestingly the original points2grid had a memory setting and max size setting (compile time) to determine whether it is done incore or outcore depending on the size. They were obviously expecting very large datasets. The behaviour of pdal suggests that there is a limit but I can’t get my head around gdal code. Not sure whether it needs more cache or an updated gdal (meaning I’ll need to go unstable - I’m at 2.1.3 with 16.04.05). The apparent memory limit running this is around 50% but the cache is defaulted to 5%. It is not multithreaded as far as I can see so I am stumped.

Is this worthwhile raising in pdal?

Attached is the mprof outputgdal


#8

Yes PDAL uses the gdal_grid API for creating the raster. So that’s a likely bottleneck. https://github.com/PDAL/PDAL/blob/master/io/GDALWriter.cpp#L239

I guess it would be interesting to see how radius affects the memory usage.


#9

Thanks for that.

Having tried a couple of radii (0.6 and 0.35 - the latter is res*sqrt2 which was a recommended radius) with no change, I noticed that the output file appears to have been written prior to the segfault. The file was abut 8.6Gb (85K x 56K which would translate to around 2.1km x 1.4km - about right) and iotop did show a long write. That makes me more confused about the source of the problem. If it was successfully written that would take us to about line 310 - no idea where from there if it completes doneFile().

What would be the next part of the script to manually run to see if the output was ok (I get totally lost in your python scripts!)?


#10

The output file would be stored in www/data/<TASKID>/odm_meshing/tmp/.

The stack trace shows you the line of the next command. (/code/opendm/dem/pdal.py 164)


#11

I further tested with the resolution set to 3 vice the original 2.5. That seemed to run OK with peak memory around 50G. I need to do a server rebuild anyway so I’ll rebuild and then run the entire pipeline with the slightly lower resolution and see whether that does actually avoid the problem. That will be a few days away.