Azure VM - 801 Aborted error

Hi all,
I recently gained access to an Azure VM with 2TB RAM and 4TB SWAP
I started processing 8054 images.
The processing stopped after 4 days and 19 hours.
I placed all of the jpg files in /home/declan/datasets/CampNov22/images
and ran…
docker run -ti --rm -v /home/declan/datasets:/datasets opendronemap/odm --project-path /datasets CampNov22 --dsm --pc-quality ultra --feature-quality ultra --pc-skip-geometric

Possible issues:
I did not set –min-num-features 15000 or –matcher-neighbors 20
Some images had long names…
e.g. 50m60ol30ca_v1-b_DJI_0185.JPG
image extensions are all caps ‘JPG’
some images and the project folder contain caps

Any ideas as to why this failed?
Many thanks,
Declan

This is the output from the logs…
{
“odmVersion”: “3.0.3”,
“memory”: {
“total”: 2064248,
“available”: 2044652
},
“cpus”: 128,
“images”: 8054,
“options”: {
“3d_tiles”: false,
“align”: null,
“auto_boundary”: false,
“auto_boundary_distance”: 0,
“bg_removal”: false,
“boundary”: {},
“build_overviews”: false,
“camera_lens”: “auto”,
“cameras”: {},
“cog”: false,
“copy_to”: null,
“crop”: 3,
“dem_decimation”: 1,
“dem_euclidean_map”: false,
“dem_gapfill_steps”: 3,
“dem_resolution”: 5,
“dsm”: true,
“dtm”: false,
“end_with”: “odm_postprocess”,
“fast_orthophoto”: false,
“feature_quality”: “ultra”,
“feature_type”: “sift”,
“force_gps”: false,
“gcp”: null,
“geo”: null,
“gps_accuracy”: 10,
“ignore_gsd”: false,
“matcher_neighbors”: 0,
“matcher_type”: “flann”,
“max_concurrency”: 128,
“merge”: “all”,
“mesh_octree_depth”: 11,
“mesh_size”: 200000,
“min_num_features”: 10000,
“name”: “CampNov22”,
“no_gpu”: false,
“optimize_disk_space”: false,
“orthophoto_compression”: “DEFLATE”,
“orthophoto_cutline”: false,
“orthophoto_kmz”: false,
“orthophoto_no_tiled”: false,
“orthophoto_png”: false,
“orthophoto_resolution”: 5,
“pc_classify”: false,
“pc_copc”: false,
“pc_csv”: false,
“pc_ept”: false,
“pc_filter”: 2.5,
“pc_las”: false,
“pc_quality”: “ultra”,
“pc_rectify”: false,
“pc_sample”: 0,
“pc_skip_geometric”: true,
“pc_tile”: false,
“primary_band”: “auto”,
“project_path”: “/datasets/CampNov22”,
“radiometric_calibration”: “none”,
“rerun”: null,
“rerun_all”: false,
“rerun_from”: null,
“rolling_shutter”: false,
“rolling_shutter_readout”: 0,
“sfm_algorithm”: “incremental”,
“skip_3dmodel”: false,
“skip_band_alignment”: false,
“skip_orthophoto”: false,
“skip_report”: false,
“sky_removal”: false,
“sm_cluster”: null,
“sm_no_align”: false,
“smrf_scalar”: 1.25,
“smrf_slope”: 0.15,
“smrf_threshold”: 0.5,
“smrf_window”: 18.0,
“split”: 999999,
“split_image_groups”: null,
“split_overlap”: 150,
“texturing_keep_unseen_faces”: false,
“texturing_single_material”: false,
“texturing_skip_global_seam_leveling”: false,
“texturing_skip_local_seam_leveling”: false,
“tiles”: false,
“use_3dmesh”: false,
“use_exif”: false,
“use_fixed_camera_params”: false,
“use_hybrid_bundle_adjustment”: false
},
“startTime”: “2023-01-17T13:02:34.104964”,
“stages”: [
{
“name”: “dataset”,
“startTime”: “2023-01-17T13:02:35.507187”,
“messages”: [
{
“message”: “Running dataset stage”,
“type”: “info”
},
{
“message”: “Loading dataset from: /datasets/CampNov22/images”,
“type”: “info”
},
{
“message”: “Loading 8057 images”,
“type”: “info”
},
{
“message”: “Cannot read /datasets/CampNov22/images/P150m60ol30ca_DJI_0370.JPG with PIL, fallback to cv2: cannot identify image file ‘/datasets/CampNov22/images/P150m60ol30ca_DJI_0370.JPG’”,
“type”: “warning”
},
{
“message”: “P150m60ol30ca_DJI_0370.JPG seems corrupted and will not be used”,
“type”: “warning”
},
{
“message”: “Cannot read /datasets/CampNov22/images/50m60ol30ca_v1-a_DJI_0051_1.JPG with PIL, fallback to cv2: cannot identify image file ‘/datasets/CampNov22/images/50m60ol30ca_v1-a_DJI_0051_1.JPG’”,
“type”: “warning”
},
{
“message”: “50m60ol30ca_v1-a_DJI_0051_1.JPG seems corrupted and will not be used”,
“type”: “warning”
},
{
“message”: “Cannot read /datasets/CampNov22/images/P140m60ol45ca_DJI_0201.JPG with PIL, fallback to cv2: cannot identify image file ‘/datasets/CampNov22/images/P140m60ol45ca_DJI_0201.JPG’”,
“type”: “warning”
},
{
“message”: “P140m60ol45ca_DJI_0201.JPG seems corrupted and will not be used”,
“type”: “warning”
},
{
“message”: “Wrote images database: /datasets/CampNov22/images.json”,
“type”: “info”
},
{
“message”: “Found 8054 usable images”,
“type”: “info”
},
{
“message”: “Parsing SRS header: WGS84 UTM 35N”,
“type”: “info”
},
{
“message”: “Finished dataset stage”,
“type”: “info”
}
]
},
{
“name”: “split”,
“startTime”: “2023-01-17T13:14:17.402761”,
“messages”: [
{
“message”: “Running split stage”,
“type”: “info”
},
{
“message”: “Normal dataset, will process all at once.”,
“type”: “info”
},
{
“message”: “Finished split stage”,
“type”: “info”
}
]
},
{
“name”: “merge”,
“startTime”: “2023-01-17T13:14:17.402919”,
“messages”: [
{
“message”: “Running merge stage”,
“type”: “info”
},
{
“message”: “Normal dataset, nothing to merge.”,
“type”: “info”
},
{
“message”: “Finished merge stage”,
“type”: “info”
}
]
},
{
“name”: “opensfm”,
“startTime”: “2023-01-17T13:14:17.403004”,
“messages”: [
{
“message”: “Running opensfm stage”,
“type”: “info”
},
{
“message”: “Maximum photo dimensions: 4056px”,
“type”: “info”
},
{
“message”: “Photo dimensions for feature extraction: 4056px”,
“type”: “info”
},
{
“message”: “nvidia-smi not found in PATH, using CPU”,
“type”: “info”
},
{
“message”: “Altitude data detected, enabling it for GPS alignment”,
“type”: “info”
},
{
“message”: [
“use_exif_size: no”,
“flann_algorithm: KDTREE”,
“feature_process_size: 4056”,
“feature_min_frames: 10000”,
“processes: 128”,
“matching_gps_neighbors: 0”,
“matching_gps_distance: 0”,
“matching_graph_rounds: 50”,
“optimize_camera_parameters: yes”,
“reconstruction_algorithm: incremental”,
“undistorted_image_format: tif”,
“bundle_outlier_filtering_type: AUTO”,
“sift_peak_threshold: 0.066”,
“align_orientation_prior: vertical”,
“triangulation_type: ROBUST”,
“retriangulation_ratio: 2”,
“matcher_type: FLANN”,
“feature_type: SIFT”,
“use_altitude_tag: yes”,
“align_method: auto”,
“local_bundle_radius: 0”
],
“type”: “info”
},
{
“message”: “Wrote reference_lla.json”,
“type”: “info”
},
{
“message”: “running "/code/SuperBuild/install/bin/opensfm/bin/opensfm" detect_features "/datasets/CampNov22/opensfm"”,
“type”: “info”
},
{
“message”: “running "/code/SuperBuild/install/bin/opensfm/bin/opensfm" match_features "/datasets/CampNov22/opensfm"”,
“type”: “info”
},
{
“message”: “running "/code/SuperBuild/install/bin/opensfm/bin/opensfm" create_tracks "/datasets/CampNov22/opensfm"”,
“type”: “info”
},
{
“message”: “running "/code/SuperBuild/install/bin/opensfm/bin/opensfm" reconstruct "/datasets/CampNov22/opensfm"”,
“type”: “info”
},
{
“message”: “Uh oh! Processing stopped because of strange values in the reconstruction. This is often a sign that the input data has some issues or the software cannot deal with it. Have you followed best practices for data acquisition? See Flying Tips — OpenDroneMap 3.0.3 documentation”,
“type”: “error”
}
],
“endTime”: “2023-01-22T07:33:49.985583”,
“totalTime”: 411572.58
}
],
“processes”: [
{
“command”: “"/code/SuperBuild/install/bin/opensfm/bin/opensfm" detect_features "/datasets/CampNov22/opensfm"”,
“exitCode”: 0,
“output”: [
“2023-01-17 13:46:47,298 DEBUG: Found 94887 points in 5.566389799118042s”,
“2023-01-17 13:46:47,299 DEBUG: done”,
“2023-01-17 13:46:47,402 DEBUG: Found 62727 points in 4.435869455337524s”,
“2023-01-17 13:46:47,402 DEBUG: done”,
“2023-01-17 13:46:47,510 DEBUG: Found 80460 points in 5.27877950668335s”,
“2023-01-17 13:46:47,511 DEBUG: done”,
“2023-01-17 13:46:47,588 DEBUG: Found 61766 points in 4.7386720180511475s”,
“2023-01-17 13:46:47,588 DEBUG: done”,
“2023-01-17 13:46:49,784 DEBUG: Found 241235 points in 6.456133842468262s”,
“2023-01-17 13:46:49,785 DEBUG: done”
]
},
{
“command”: “"/code/SuperBuild/install/bin/opensfm/bin/opensfm" match_features "/datasets/CampNov22/opensfm"”,
“exitCode”: 0,
“output”: [
“2023-01-17 18:44:20,619 DEBUG: Matching 50m60ol30ca_v1-b_DJI_0185.JPG and P135m60ol70ca_DJI_0281.JPG. Matcher: FLANN (symmetric) T-desc: 52.862 T-robust: 0.018 T-total: 52.880 Matches: 147 Robust: 12 Success: False”,
“2023-01-17 18:44:20,892 DEBUG: Matching P225m60ol70ca_DJI_0164.JPG and P230m60ol45ca_DJI_0710.JPG. Matcher: FLANN (symmetric) T-desc: 100.926 T-robust: 0.019 T-total: 100.946 Matches: 215 Robust: 24 Success: True”,
“2023-01-17 18:44:21,161 DEBUG: Matching 50m60ol30ca_v1-a_DJI_0045_1.JPG and P135m60ol70ca_DJI_0975.JPG. Matcher: FLANN (symmetric) T-desc: 51.578 T-robust: 0.017 T-total: 51.596 Matches: 124 Robust: 13 Success: False”,
“2023-01-17 18:44:21,521 DEBUG: Matching P340m60ol45ca_DJI_0776.JPG and P335m60ol70ca_DJI_0203.JPG. Matcher: FLANN (symmetric) T-desc: 3.161 T-robust: 0.002 T-total: 3.164 Matches: 493 Robust: 364 Success: True”,
“2023-01-17 18:44:21,589 DEBUG: Matching P140m60ol45ca_DJI_0554.JPG and 50m60ol30ca_v1-b_DJI_0691_1.JPG. Matcher: FLANN (symmetric) T-desc: 60.018 T-robust: 0.022 T-total: 60.041 Matches: 432 Robust: 218 Success: True”,
“2023-01-17 18:44:22,109 DEBUG: Matching 50m60ol30ca_v1-a_DJI_0200.JPG and 50m60ol30ca_v1-b_DJI_0105_2.JPG. Matcher: FLANN (symmetric) T-desc: 1.487 T-robust: 0.010 T-total: 1.497 Matches: 129 Robust: 71 Success: True”,
“2023-01-17 18:44:22,360 DEBUG: Matching P135m60ol70ca_DJI_0335.JPG and P135m60ol70ca_DJI_0644.JPG. Matcher: FLANN (symmetric) T-desc: 101.663 T-robust: 0.005 T-total: 101.669 Matches: 497 Robust: 327 Success: True”,
“2023-01-17 18:44:24,745 DEBUG: Matching P135m60ol70ca_DJI_0301.JPG and P140m60ol45ca_DJI_0393.JPG. Matcher: FLANN (symmetric) T-desc: 65.615 T-robust: 0.020 T-total: 65.637 Matches: 295 Robust: 104 Success: True”,
“2023-01-17 18:45:00,511 DEBUG: Matching P135m60ol70ca_DJI_0007.JPG and P135m60ol70ca_DJI_0939.JPG. Matcher: FLANN (symmetric) T-desc: 108.210 T-robust: 0.019 T-total: 108.230 Matches: 286 Robust: 52 Success: True”,
“2023-01-17 18:45:00,680 INFO: Matched 88503 pairs (brown-brown: 88503) in 17866.780273348006 seconds (0.20187767962496184 seconds/pair).”
]
},
{
“command”: “"/code/SuperBuild/install/bin/opensfm/bin/opensfm" create_tracks "/datasets/CampNov22/opensfm"”,
“exitCode”: 0,
“output”: [
“2023-01-17 19:06:12,183 INFO: reading features”,
“2023-01-17 19:39:50,274 DEBUG: Merging features onto tracks”,
“2023-01-17 19:53:57,935 DEBUG: Good tracks: 53994633”
]
},
{
“command”: “"/code/SuperBuild/install/bin/opensfm/bin/opensfm" reconstruct "/datasets/CampNov22/opensfm"”,
“exitCode”: 134,
“output”: [
“2023-01-22 06:57:55,070 INFO: Adding 50m60ol30ca_v1-b_DJI_0041_2.JPG to the reconstruction”,
“2023-01-22 06:58:40,115 INFO: -------------------------------------------------------”,
“2023-01-22 06:58:40,167 INFO: 50m60ol30ca_v1-b_DJI_0039_2.JPG resection inliers: 944 / 981”,
“2023-01-22 06:58:40,218 INFO: Adding 50m60ol30ca_v1-b_DJI_0039_2.JPG to the reconstruction”,
“2023-01-22 06:59:25,356 INFO: -------------------------------------------------------”,
“2023-01-22 06:59:25,404 INFO: 50m60ol30ca_v1-b_DJI_0037_2.JPG resection inliers: 1059 / 1164”,
“2023-01-22 06:59:25,461 INFO: Adding 50m60ol30ca_v1-b_DJI_0037_2.JPG to the reconstruction”,
“2023-01-22 06:59:26,626 INFO: Shots and/or GCPs are well-conditioned. Using naive 3D-3D alignment.”,
“block_sparse_matrix.cc:80 Check failed: num_nonzeros_ >= 0”,
“/code/SuperBuild/install/bin/opensfm/bin/opensfm: line 12: 801 Aborted (core dumped) "$PYTHON" "$DIR"/opensfm_main.py "[email protected]"”
]
}
],
“success”: false,
“error”: {
“code”: 134,
“message”: “Child returned 134”
},
“stackTrace”: [
“Traceback (most recent call last):”,
“File "/code/stages/odm_app.py", line 81, in execute”,
“self.first_stage.run()”,
“File "/code/opendm/types.py", line 386, in run”,
“self.next_stage.run(outputs)”,
“File "/code/opendm/types.py", line 386, in run”,
“self.next_stage.run(outputs)”,
“File "/code/opendm/types.py", line 386, in run”,
“self.next_stage.run(outputs)”,
“File "/code/opendm/types.py", line 365, in run”,
“self.process(self.args, outputs)”,
“File "/code/stages/run_opensfm.py", line 38, in process”,
“octx.reconstruct(args.rolling_shutter, self.rerun())”,
“File "/code/opendm/osfm.py", line 55, in reconstruct”,
“self.run(‘reconstruct’)”,
“File "/code/opendm/osfm.py", line 34, in run”,
“system.run(‘"%s" %s "%s"’ %”,
“File "/code/opendm/system.py", line 110, in run”,
“raise SubprocessException("Child returned {}".format(retcode), retcode)”,
“opendm.system.SubprocessException: Child returned 134”,
“”
],
“endTime”: “2023-01-22T07:33:49.985583”,
“totalTime”: 412275.88
}

1 Like

Surprisingly, I suspect you ran out of RAM (core dumped), which I’ve never managed to do on a smaller machine with a similar-sized dataset. Since this is at the OpenSfM stage, the solution for RAM is to turn down your --feature-quality, which given the quality of your data should be just fine.

It is probably fine to keep --pc-quality at ultra, but for test, you can turn this down to high. However, this will effect your dataset quality.

2 Likes

Ok. I am rerunning it with feature-quality and pc-quality set to high… lets see what happens

2 Likes

swappiness is set to 60 on this linux machine. Should I change this or leave as is?

1 Like

I’m not sure. What kind / speed of disks are backing it?

Hopefully, by switching to high, we’ve reduced your memory pressure enough you could tune this down to 10, which is my typical profile.

But I haven’t tested the implications. I just know it works for me so far.

2 Likes

Looks like there are 128 of these:
processor : 127
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
stepping : 7
microcode : 0xffffffff
cpu MHz : 2693.673
cache size : 39424 KB
physical id : 3
siblings : 32
core id : 15
cpu cores : 16
apicid : 127
initial apicid : 127
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni arch_capabilities
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs taa mmio_stale_data retbleed
bogomips : 5392.75
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

1 Like

Ah, that might explain the OOM: ratio of processors/threads to available memory matters.

Like you, I have 16GB per core, 8GB per thread, but allocate at the VM level 24GB per core, 12GB per thread.

Running at high instead of ultra, you should be alright. But I agree you should turn down your swappiness to 10. I think you can safely do this while it runs… . But the MS folks will know for sure.

2 Likes

Swappiness is a runtime parameter so it is made to be changed while things are live. That being said… I’ve never done it while under memory pressure :grimacing:

2 Likes

:joy: yes. I would modify on the fly under the assumption that it wouldn’t apply to anything already written only to future choices about memory use. But I regularly break long running processes with bold, but wrong assumptions.

1 Like

Hi Stephen,
I am not sure what to do to allocate at the VM level 24GB per core, 12GB per thread.
Can you give more info on this or point me in the right direction?
The process is currently ‘adding images to the reconstruction’
It doesn’t seem to be using the SWAP at all.
88GB free from 2TB but most is in ‘buff/cache’
Thanks,
Declan

1 Like

With swappiness set to 10, it will try to avoid swapping until roughly 90% of the physical memory is exhausted, so you shouldn’t see anything in swap for a while yet.

1 Like

The easiest way (I’ve got it done at the hypervisor level which you won’t have access to where you’re running) is the limit your max-concurrency to 64.

1 Like

1 Like

And the logs show that the swap wasn’t used. Bit weird!
This is the data from sysstat for the day the processing stopped…

1 Like

This isn’t a memory issue, it looks like a Ceres Solver issue which needs a decently informed developer to switch Ceres Solver from 32 to 64-bit integers. In short: it is an overflow, but of a specific part of the toolchain and not a memory overflow:

3 Likes

Thanks for this clarification Stephen. This makes more sense.
It would appear that there is no obvious solution to this other than running a split merge and getting a modeller to join them.

1 Like

I think so. If you run with local split-merge, all the meshes will be available on disk, but if you open an issue on the nodeodm repo, I can also look into how best to modify the API to allow for distributed processing which would be faster.

2 Likes