Split large data set - guidelines and recommendations

gast · June 22, 2020, 8:41pm

Hi,

I was looking around the forum to fish for recommendations on starting point for setting --split and --split-overlap. Came across few specific threads but thought to start a new topic on find a good rule of thumb to start from. For instance, would be great to set a recommendation for a 5000 image data set such as: --split = # images / 10 and --split-overlap swath / 2 or anything else that people had good experience with. What about 10000 images? larger data set?

Suggestions?

Tnx !

pierotofy · June 22, 2020, 9:07pm

Choose the largest number a node can safely process; less splits = better results.

Saijin_Naib · June 22, 2020, 9:12pm

Does memory consumption scale linearly with image count, megapixels per image, or total pixel count (images x MP)?

Would one be able to make a safe guesstimate based upon available RAM, image MP, and image count to say, for instance, On 16GB you can safely process 3.2GP, so 200x16MP images per split.

gast · June 23, 2020, 5:17am

Thanks Saijin.
That’s exactly what I’m trying to put the finger on… a valid guesstimate based on HW params.
How did you arrive at 16GB => 3.2Gpix? experience or some inherent functionality? what’s the transformation?

Saijin_Naib · June 23, 2020, 5:32am

@gast, that is merely an example number I pulled from nothing. I was just musing and wanted to use simple/round numbers for my example.

I’d need to do some profiling to get a better sense of what the relationship might be.

smathermather · June 23, 2020, 12:41pm

A good place to start is with the ClusterODM example slugs:

github.com

OpenDroneMap/ClusterODM/blob/master/docs/digitalocean.md

# Provider Configuration for DigitalOcean

Example configuration file:

```json
{
    "provider": "digitalocean",
    "accessToken": "CHANGEME",
    "s3": {
        "accessKey": "CHANGEME",
        "secretKey": "CHANGEME",
        "endpoint" :"sfo2.digitaloceanspaces.com",
        "bucket": "CHANGEME",
        "ignoreSSL": false
    },

    "createRetries": 10,
    "maxRuntime": -1,
    "maxUploadTime": -1,
    "dropletsLimit": 30,

This file has been truncated. show original

Specifically:

    "imageSizeMapping": [
        {"maxImages": 40, "slug": "s-2vcpu-2gb"},
        {"maxImages": 250, "slug": "s-4vcpu-8gb"},
        {"maxImages": 500, "slug": "s-6vcpu-16gb"},
        {"maxImages": 1500, "slug": "s-8vcpu-32gb"},
        {"maxImages": 2500, "slug": "s-16vcpu-64gb"},
        {"maxImages": 3500, "slug": "s-20vcpu-96gb"},
        {"maxImages": 5000, "slug": "s-24vcpu-128gb"}

smathermather · June 23, 2020, 1:49pm

One thing I should add to this: this assumes a 1.5 or 2x inclusion of swapfile, so for 128GB, one would have a 192-256GB swapfile.

Saijin_Naib · June 23, 2020, 2:05pm

And that maximages, does that assume a certain MP resolution per image, or is it solely tied to image count and a 1.2MP image counts the same as a 12MP image towards the RAM limit?

gast · June 23, 2020, 2:08pm

Good question - does ODM load all images in entirety to memory when processing begins or is there some other managed flow to loading images into memory?

smathermather · June 23, 2020, 2:24pm

It’s pretty efficient for most steps. OpenSfM (if memory serves me) does a bit of streaming, I think and writes out a lot to disk, but not in an IO limiting way. I will be doing some profiling over the next week for a project, so I will update this thread as I know more, but I think the mvs-texturing step is the current memory bottleneck.

smathermather · June 23, 2020, 2:27pm

Hmm, I don’t know. There may be some assumptions that are made based on the ways that Piero has shaped available settings in the lightning network. That said, using these settings on a previous project with auto-scaling, I don’t think I hit any memory issues, and I was using some pretty aggressive product resolution and depthmap resolution settings.

gast · June 24, 2020, 11:20am

I recently tried to run about 2600 images, each around 5MB with
--split 500
--split-overlap 50
on a 50GB memory machine which ended using up 100GB of disk for writing (…and crashed since I didn’t have more disk) ;). So yes - OpenSfM seems to use up disk.

smathermather · June 24, 2020, 1:55pm

Haha! The (generous) estimate I use for disk usage is 10x during processing. So 50GB allocate 500GB.

gast · July 11, 2020, 8:16pm

Quick clarification @smathermather : in the example above 50GB is the size of the data set or size of RAM?
Would the allocation change for --split/–split-overlap job?

Tnx !!

smathermather · July 12, 2020, 12:43am

I meant on disk storage, which can be mitigated by setting the --optimize-disk-space flag. Then the storage you need is ~2x by the end. I don’t know what the intermediate storage needs are.

Regarding resource, especially RAM, needs for split merge, I haven’t done any serious profiling since some pretty substantive updates a few months back, but expect to in the next few days to weeks.

gast · July 12, 2020, 6:25am

I am familiar with --optimize-disk-space, however I refered the “So 50GB allocate 500GB” comment… is it A or B below:

Option A:
I have a 50GB data set so I need at least 500GB of disk storage

Option B:
I have a 50GB RAM on my machine, which should go hand-in-hand with at least 500GB of disk storage

again, thank you.

smathermather · July 12, 2020, 5:03pm

Gotcha. Option A.

smathermather · July 12, 2020, 5:06pm

To answer the other question, if you have 50GB of RAM, aim for at least 50GB of swap, or even 75 or 100 GB of swap.

gast · July 12, 2020, 8:41pm

Kudos

gast · July 13, 2020, 4:01am

Just as informative info, see below of runs that have not finished yet. Data is 2788 images, each roughly 6MB.

Using --split 500

Using --split 1000

Using --split 1500

Also --split-overlap 100 and all other params are the defaults.