Split large data set - guidelines and recommendations


I was looking around the forum to fish for recommendations on starting point for setting --split and --split-overlap. Came across few specific threads but thought to start a new topic on find a good rule of thumb to start from. For instance, would be great to set a recommendation for a 5000 image data set such as: --split = # images / 10 and --split-overlap swath / 2 or anything else that people had good experience with. What about 10000 images? larger data set?


Tnx !

Choose the largest number a node can safely process; less splits = better results.


Does memory consumption scale linearly with image count, megapixels per image, or total pixel count (images x MP)?

Would one be able to make a safe guesstimate based upon available RAM, image MP, and image count to say, for instance, On 16GB you can safely process 3.2GP, so 200x16MP images per split.

Thanks Saijin.
That’s exactly what I’m trying to put the finger on… a valid guesstimate based on HW params.
How did you arrive at 16GB => 3.2Gpix? experience or some inherent functionality? what’s the transformation?

1 Like

@gast, that is merely an example number I pulled from nothing. I was just musing and wanted to use simple/round numbers for my example.

I’d need to do some profiling to get a better sense of what the relationship might be.

A good place to start is with the ClusterODM example slugs:


    "imageSizeMapping": [
        {"maxImages": 40, "slug": "s-2vcpu-2gb"},
        {"maxImages": 250, "slug": "s-4vcpu-8gb"},
        {"maxImages": 500, "slug": "s-6vcpu-16gb"},
        {"maxImages": 1500, "slug": "s-8vcpu-32gb"},
        {"maxImages": 2500, "slug": "s-16vcpu-64gb"},
        {"maxImages": 3500, "slug": "s-20vcpu-96gb"},
        {"maxImages": 5000, "slug": "s-24vcpu-128gb"}
1 Like

One thing I should add to this: this assumes a 1.5 or 2x inclusion of swapfile, so for 128GB, one would have a 192-256GB swapfile.

1 Like

And that maximages, does that assume a certain MP resolution per image, or is it solely tied to image count and a 1.2MP image counts the same as a 12MP image towards the RAM limit?

Good question - does ODM load all images in entirety to memory when processing begins or is there some other managed flow to loading images into memory?

1 Like

It’s pretty efficient for most steps. OpenSfM (if memory serves me) does a bit of streaming, I think and writes out a lot to disk, but not in an IO limiting way. I will be doing some profiling over the next week for a project, so I will update this thread as I know more, but I think the mvs-texturing step is the current memory bottleneck.

1 Like

Hmm, I don’t know. There may be some assumptions that are made based on the ways that Piero has shaped available settings in the lightning network. That said, using these settings on a previous project with auto-scaling, I don’t think I hit any memory issues, and I was using some pretty aggressive product resolution and depthmap resolution settings.

1 Like

I recently tried to run about 2600 images, each around 5MB with
--split 500
--split-overlap 50
on a 50GB memory machine which ended using up 100GB of disk for writing (…and crashed since I didn’t have more disk) ;). So yes - OpenSfM seems to use up disk.

1 Like

Haha! The (generous) estimate I use for disk usage is 10x during processing. So 50GB allocate 500GB. :man_facepalming:


Quick clarification @smathermather-cm : in the example above 50GB is the size of the data set or size of RAM?
Would the allocation change for --split/–split-overlap job?

Tnx !!

I meant on disk storage, which can be mitigated by setting the --optimize-disk-space flag. Then the storage you need is ~2x by the end. I don’t know what the intermediate storage needs are.

Regarding resource, especially RAM, needs for split merge, I haven’t done any serious profiling since some pretty substantive updates a few months back, but expect to in the next few days to weeks.

1 Like

I am familiar with --optimize-disk-space, however I refered the “So 50GB allocate 500GB” comment… is it A or B below:

Option A:
I have a 50GB data set so I need at least 500GB of disk storage

Option B:
I have a 50GB RAM on my machine, which should go hand-in-hand with at least 500GB of disk storage

again, thank you.

Gotcha. Option A.

To answer the other question, if you have 50GB of RAM, aim for at least 50GB of swap, or even 75 or 100 GB of swap.

1 Like


1 Like

Just as informative info, see below of runs that have not finished yet. Data is 2788 images, each roughly 6MB.

Using --split 500

Using --split 1000

Using --split 1500

Also --split-overlap 100 and all other params are the defaults.