Large data set woes


#1

Still working on the large data set (737 images) that I was stuck uploading before but now there is a very odd issue with memory:
I’m running the “Server Install” on Ubuntu 16.04 AWS - r4.8xlarge - this has 32 cores and 244GB of memory (pretty sure that’s much more than enough).
It keeps failing, but now I’ve captured the failures and maybe someone can help.
I’m running it with the preset “High Resolution” and no resize of images (a machine with the above power should handle that no problem).
it seems to be failing on opensfm with a complaint that maybe there isn’t enough RAM, although at the time of the crash less than 0.5% of the RAM was being used. See attached screenshots
57 44

In addition here is the end of the output (I have the full log if needed):

2018-11-08 16:29:32,728 DEBUG: Robust matching time : 0.000403881072998s
2018-11-08 16:29:32,728 DEBUG: Full matching 1033 / 1055, time: 2.0970249176s
[DEBUG] running PYTHONPATH=/code/SuperBuild/install/lib/python2.7/dist-packages /code/SuperBuild/src/opensfm/bin/opensfm create_tracks /www/data/a8a6af23-8999-48b5-ad00-f96a33fdaab6/opensfm
2018-11-08 16:29:33,440 INFO: reading features
2018-11-08 16:30:06,190 DEBUG: Merging features onto tracks
2018-11-08 16:30:33,871 DEBUG: Good tracks: 797364
[DEBUG] running PYTHONPATH=/code/SuperBuild/install/lib/python2.7/dist-packages /code/SuperBuild/src/opensfm/bin/opensfm reconstruct /www/data/a8a6af23-8999-48b5-ad00-f96a33fdaab6/opensfm
2018-11-08 16:32:15,002 INFO: Starting incremental reconstruction
Traceback (most recent call last):
File “/code/SuperBuild/src/opensfm/bin/opensfm”, line 34, in
command.run(args)
File “/code/SuperBuild/src/opensfm/opensfm/commands/reconstruct.py”, line 21, in run
report = reconstruction.incremental_reconstruction(data)
File “/code/SuperBuild/src/opensfm/opensfm/reconstruction.py”, line 1161, in incremental_reconstruction
pairs = compute_image_pairs(common_tracks, data)
File “/code/SuperBuild/src/opensfm/opensfm/reconstruction.py”, line 416, in compute_image_pairs
result = parallel_map(_compute_pair_reconstructability, args, processes)
File “/code/SuperBuild/src/opensfm/opensfm/context.py”, line 38, in parallel_map
return list(e.map(func, args))
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 1069, in map
timeout=timeout)
File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 581, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File “/usr/local/lib/python2.7/dist-packages/loky/reusable_executor.py”, line 151, in submit
fn, *args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 1016, in submit
raise self._flags.broken
loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
Traceback (most recent call last):
File “/code/run.py”, line 47, in
plasm.execute(niter=1)
File “/code/scripts/run_opensfm.py”, line 133, in process
(context.pyopencv_path, context.opensfm_path, tree.opensfm))
File “/code/opendm/system.py”, line 34, in run
raise Exception(“Child returned {}”.format(retcode))
Exception: Child returned 1


#2

Have you tried setting the flag to reduce the number of cores you’re using? (Confession, I haven’t read your log yet…).


#3

I actually did based on your previous suggestion and lowered it to 16, did not help. Question is why would I reduce the number of cores in any case?


#4

More cores lead to more memory usage. Less cores = less memory usage.

244GB will certainly let you process 737 images. Can you confirm that you set --max-concurrency to a lower value?

If you start with a value of 1 OpenSfM will output a better error message, perhaps this is not a memory problem but something else.


#5

lowered max_concurrency to one, still same problem - tons of memory on the machine when it fails - here is log end from latest run:

2018-11-09 13:34:56,672 DEBUG: Robust matching time : 0.000518083572388s 2018-11-09 13:34:56,672 DEBUG: Full matching 1006 / 1058, time: 2.08180403709s [DEBUG] running PYTHONPATH=/code/SuperBuild/install/lib/python2.7/dist-packages /code/SuperBuild/src/opensfm/bin/opensfm create_tracks /www/data/838b255e-3e99-49e2-beab-5ae8bb0653f1/opensfm 2018-11-09 13:34:57,475 INFO: reading features 2018-11-09 13:35:30,357 DEBUG: Merging features onto tracks 2018-11-09 13:35:58,502 DEBUG: Good tracks: 798079 [DEBUG] running PYTHONPATH=/code/SuperBuild/install/lib/python2.7/dist-packages /code/SuperBuild/src/opensfm/bin/opensfm reconstruct /www/data/838b255e-3e99-49e2-beab-5ae8bb0653f1/opensfm 2018-11-09 13:37:44,646 INFO: Starting incremental reconstruction Traceback (most recent call last): File “/code/SuperBuild/src/opensfm/bin/opensfm”, line 34, in <module> command.run(args) File “/code/SuperBuild/src/opensfm/opensfm/commands/reconstruct.py”, line 21, in run report = reconstruction.incremental_reconstruction(data) File “/code/SuperBuild/src/opensfm/opensfm/reconstruction.py”, line 1161, in incremental_reconstruction pairs = compute_image_pairs(common_tracks, data) File “/code/SuperBuild/src/opensfm/opensfm/reconstruction.py”, line 416, in compute_image_pairs result = parallel_map(_compute_pair_reconstructability, args, processes) File “/code/SuperBuild/src/opensfm/opensfm/context.py”, line 38, in parallel_map return list(e.map(func, args)) File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 1069, in map timeout=timeout) File “/usr/local/lib/python2.7/dist-packages/loky/_base.py”, line 581, in map fs = [self.submit(fn, *args) for args in zip(*iterables)] File “/usr/local/lib/python2.7/dist-packages/loky/reusable_executor.py”, line 151, in submit fn, *args, **kwargs) File “/usr/local/lib/python2.7/dist-packages/loky/process_executor.py”, line 1016, in submit raise self._flags.broken loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. Traceback (most recent call last): File “/code/run.py”, line 47, in <module> plasm.execute(niter=1) File “/code/scripts/run_opensfm.py”, line 133, in process (context.pyopencv_path, context.opensfm_path, tree.opensfm)) File “/code/opendm/system.py”, line 34, in run raise Exception(“Child returned {}”.format(retcode)) Exception: Child returned 1


#6

From the output, it looks as the max_concurrency parameter is still set to more than 1. You shouldn’t get this error:

In compute_image_pairs result = parallel_map(_compute_pair_reconstructability, args, processes) File “/code/SuperBuild/src/opensfm/opensfm/context.py”, line 38, in parallel_map return list(e.map(func, args))

If OpenSfM ran on a single process. Did you create a new task or did you restart the old one when setting the max_concurrency parameter?


#7

I restarted it. Now I’ll run the whole thing from scratch


#8

@pierotofy so unfortunately I gave up here as when I tried running this with max_concurrency = 1 it just hung way before it got to opensfm… I ran it three times and it hung on each time (killed it about 10-12 hours in to the run). There was no memory problem but of course CPU was constantly at 100%, the parameters I were giving were the preset “High Resolution” and no resize of images - I think that there is something buggy with it but at this point I really needed to get the project done so I gave up. I’m happy to share the images on a google drive or dropbox if you’d like to check.
I ended up running it with “Default” preset and resize images to default (2048) - the process itself worked flawlessly on a much less stronger machine (16 vCPU and 64GB RAM) and took 3 hours 43 minutes to completion. I got all the assets but now have a pretty big problem with the results - I will open a new thread about this as its not relevant to what was discussed here as it relates to the quality of the results and some pretty bad deformations of the orthophoto.


#9

A copy of the dataset will be most helpful to help diagnose this. Thank you.