Multiple ODM containers running simultaneous will crash while stitching


#1

I am running into an error attempting to stitch multiple ortho-mosaics at the same time in Docker via ODM’s CLI. I am using the version 0.4 image of opendronemap/opendronemap and creating 5 containers to create 5 simultaneous stitches.

Hardware:
24 CPU machine with 128 GB of RAM running Windows 10 Pro.

Test:
Run multiple ODM containers in parallel in Docker. I have allocated 20 CPU’s and 120 GB of memory to docker. I am running 5 ODM containers in parallel using opendronemap/opendronemap version 0.4. All commands below are executed about the same time, within a second of each other. The dataset being stitched contains 80 images.

Commands:
docker run -ti --rm -v c:/project/OpenDroneMap/odm_input1:/datasets/code opendronemap/opendronemap --project-path /datasets --fast-orthophoto --orthophoto-compression LZW

docker run -ti --rm -v c:/project/OpenDroneMap/odm_input2:/datasets/code opendronemap/opendronemap --project-path /datasets --fast-orthophoto --orthophoto-compression LZW

docker run -ti --rm -v c:/project/OpenDroneMap/odm_input3:/datasets/code opendronemap/opendronemap --project-path /datasets --fast-orthophoto --orthophoto-compression LZW

docker run -ti --rm -v c:/project/OpenDroneMap/odm_input4:/datasets/code opendronemap/opendronemap --project-path /datasets --fast-orthophoto --orthophoto-compression LZW

docker run -ti --rm -v c:/project/OpenDroneMap/odm_input5:/datasets/code opendronemap/opendronemap --project-path /datasets --fast-orthophoto --orthophoto-compression LZW

Results:
On a rare occasion more than 1 will succeed, most of the time 4 fail and 1 will succeed. The ones that don’t succeed will always error with the same stack trace seen below.

Error:

[INFO] Running ODM Meshing Cell
[DEBUG] Writing ODM 2.5D Mesh file in: /datasets/code/odm_meshing/odm_25dmesh.ply
[INFO] Created temporary directory: /datasets/code/odm_meshing/tmp
[INFO] Creating DSM for 2.5D mesh
[INFO] Creating …/datasets/code/odm_meshing/tmp/mesh_dsm_r0.282842712475 [idw] from 1 files
[DEBUG] running pdal pipeline -i /tmp/tmpA7ngCA.json > /dev/null 2>&1
Traceback (most recent call last):
File "/code/run.py", line 47, in <module>
plasm.execute(niter=1)
File "/code/scripts/odm_meshing.py", line 98, in process
max_workers=args.max_concurrency)
File "/code/opendm/mesh.py", line 35, in create_25dmesh
max_workers=max_workers
File "/code/opendm/dem/commands.py", line 38, in create_dems
fouts = list(e.map(create_dem_for_radius, radius))
File "/usr/local/lib/python2.7/dist-packages/loky/process_executor.py", line 788, in _chain_from_iterable_of_lists
for element in iterable:
File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 589, in result_iterator
yield future.result()
File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 433, in result
return self.__get_result()
File "/usr/local/lib/python2.7/dist-packages/loky/_base.py", line 381, in __get_result
raise self._exception
Exception: Child returned 1

Any help to resolve this error would be greatly appreciated.


#2

Hey @Joe_Tinguely, it’s possible that you might be running out of memory, depending on how many images each of your tasks has (that would explain why sometimes they complete and sometimes they do not).

Do they work if you run only 2 or 3 tasks simultaneously instead of 5?


#3

I ran a test only running 2 tasks simultaneously and both tasks failed with the same error.

[INFO] Running ODM OpenSfM Cell - Finished
[INFO] Running ODM Meshing Cell
[DEBUG] Writing ODM 2.5D Mesh file in: /datasets/code/odm_meshing/odm_25dmesh.ply
[WARNING] Cannot calculate GSD, using requested resolution of 5.0
[INFO] Created temporary directory: /datasets/code/odm_meshing/tmp
[INFO] Creating DSM for 2.5D mesh
[INFO] Creating …/datasets/code/odm_meshing/tmp/mesh_dsm_r0.282842712475 [idw] from 1 files
[DEBUG] running pdal pipeline -i /tmp/tmpy4JIWA.json > /dev/null 2>&1
Traceback (most recent call last):
File “/code/run.py”, line 47, in <module>
plasm.execute(niter=1)
File “/code/scripts/odm_meshing.py”, line 98, in process
max_workers=args.max_concurrency)
File “/code/opendm/mesh.py”, line 35, in create_25dmesh
max_workers=max_workers
File “/code/opendm/dem/commands.py”, line 38, in create_dems
fouts = list(e.map(create_dem_for_radius, radius))
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 788, in _chain_from_iterable_of_lists
for element in iterable:
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 589, in result_iterator
yield future.result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 433, in result
return self.__get_result()
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 381, in __get_result
raise self._exception
Exception: Child returned 1


#4

If you set --max-concurrency to a lower value (say, 4), does processing succeed?


#5

@pierotofy I set --max-concurrency to 4 and I ran 3 containers in parallel, 1 succeeded and 2 failed.

The 2 that failed had the same stack trace but returned different codes:

Fail 1:
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 381, in __get_result
raise self._exception
Exception: Child returned 137

Fail 2:
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 381, in __get_result
raise self._exception
Exception: Child returned 1


#6

@pierotofy I ran another test and set --max-concurrency to 3 and I ran 3 containers in parallel, 2 succeeded and 1 failed.

The 2 processes that succeeded, only 1 of the orthophoto’s stitched properly. Below is a section of the tiff for example purposes.

Stitched properly:
PNG

Stitched with problems:
PNG