Merging Large Orthophotos Hangs

I’m having some strange issues with the ODM split-merge workflow. I am running ODM with Docker on a machine with 32 CPUs and 48 GBs of RAM. My dataset is relatively small (850 images), but they are taken from a Phantom 4 drone flying at 10 m and have a very low GSD. I am trying to build a very high resolution orthophoto, and I’ve found that splitting into three clusters is a good way to do this without running out of RAM.

However, the behavior I’m seeing is that the split step works just fine, but then it hangs during the orthophoto merge step. Looking in top, I just see the Python process sitting at ~100% CPU usage. The RAM usage increases for the first 10 minutes or so, and then levels off somewhere around 8 GB. After that, it does nothing. There is no output in the terminal. It does create an output merged orthophoto file on the disk, but this grows excruciatingly slowly (a few hundred KBs every 10 minutes or so).

I haven’t been able to find a lot of similar issues, other than issue #988 on Github (sorry, it won’t let me add a link), which appears to have been solved a long time ago. I think I’ve traced where it’s getting stuck for me to this section in the code, based on the behavior and the output that I’m seeing. I’m going to try running this manually on my dev machine to see if I can a) reproduce the issue, and b) come up with a firmer diagnosis.

This is the command I’m running:

singularity run docker://opendronemap/odm \
  --project-path /lscratch/odm_jeevan_field \
  --max-concurrency 32 \
  --orthophoto-resolution 0.27 \
  --fast-orthophoto \
  --skip-3dmodel \
  --matcher-distance 30 \
  --split 400 \
  --split-overlap 10 \
  --rerun-all \
  --texturing-skip-local-seam-leveling \
  --merge orthophoto \
  --orthophoto-compression JPEG \
  2021_08_09_p4_ortho

Unfortunately, I cannot share all of my images for confidentiality reasons, but I have provided a few examples. I’ve also included the output from ODM in that same folder.

Has anyone else had this problem before? I’d appreciate any insights you can provide.

1 Like

Welcome!

Thanks for digging into this.

I’m out of my depth here, but perhaps someone more knowledgeable will come by shortly to add to the discussion!

Well, I’m thoroughly confused…

I’ve traced the performance issue down to Dataset.read() in rasterio, (e.g. this line). It works perfectly for exactly the first 1583 blocks, and then on the 1584th, read() on one of the sources suddenly starts taking >25 seconds. And yes, I’ve checked, it actually is related to the number of calls, and not the blocks themselves. If you have it skip the first few blocks, it will start to slow down exactly 1584 blocks later.

This might well be an issue with rasterio and not ODM per se.

1 Like

That’s amazing profiling!

Any idea what could possibly cause that within rasterio?

As of yet, no clue. The fact that it only manifests after a certain number of calls suggests something odd is going on with caching or something of that sort.

I’ve also noticed several other things:

  • The problem seems to only manifest when loading GeoTIFFs for the first sub-model. I’m not sure why. It affects both the feathered and cropped TIFFs.
  • The GeoTIFFs themselves look valid. I am able to open them (albeit slowly) in QGIS.

If it would help with debugging at all, I can upload the specific GeoTIFFs being generated for each split to Google Drive. I am not an expert on this format, so there could very well be something that I’m missing.

1 Like

Yeah, something weird is definitely happening. I straced the process, and I see that while it’s hung, it’s making hundreds of read() calls every second. Isn’t the whole point of block-based reading that you don’t need to read the whole file into memory? Because it sure is reading a lot of data for a single block…

Incidentally, I notice that the block windows come from the destination, not the sources. This is conjecture, but is it possible that blocks in the destination don’t line up with blocks in the sources? Could we be trying to do a read that happens to be very inefficient?

1 Like

Well, I’ve figured out something new: this problem has something to do with GDAL’s block caching mechanism. Specifically, increasing the cache size manually causes it to take longer to hang (with a corresponding spike in memory usage). Similarly, by reducing the cache size, you can get it to hang quickly. I still don’t know why this would happen, though, unless GDAL’s eviction algorithm is agonizingly slow. Since it’s accessing blocks one-by-one in a regular pattern, it seems like it should have no trouble purging old blocks. In fact, it seems like it should be accessing every block exactly once, such that caching should provide no benefit at all.

1 Like

Do you know for sure if this is the case? If so, I know Even Roualt would absolutely love having some assistance keeping GDAL fixed up and would welcome a PR, or even just a really detailed/technical issue report.

I’m working on gathering more info on this.

Right now, I see the main obstacle as a lack of reproducibility. If others were chiming in saying that they had this problem, then I think we would be in a better place. As of now, it appears to be just me, and I don’t think I’m doing anything that strange, which makes me suspicious.

1 Like

Here’s something you might find interesting:

The same problem manifests with rio-merge-rgba, which the ODM code is apparently based on. In fact, just to confirm that I’m not crazy, I’ve uploaded three of the orthophotos that it’s trying to merge to the Google Drive folder. Try merging them with rio-merge-rgba, or even merging any two of them, and you should see it hang.

1 Like

Does RIO link into GDAL as well?

Yeah, it uses rasterio, which relies on GDAL internally.

1 Like

After merging the failed orthophotos manually with QGIS, I noticed that they were not georeferenced correctly, causing the result to not be aligned. This was because I didn’t have enough GCPs in each split. I re-ran without GCPs (--force-gps), and interestingly, the merge had no problem this time, even though my GPS is not very accurate and the orthophotos were not perfectly aligned. So I think this issue is an edge-case that manifests with very “weird” orthophotos. If someone else who is encountering this happens upon this thread, I suggest they take a look at the orthophotos from the submodels and make sure they look correct and are properly georeferenced.

I’d still like to understand why this happens, but it is no longer as urgent.

2 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.