Bad zipfile error when using nodeODM GPU (race condition?)

I’ve now got a project where I can consistently replicate the “bad zipfile” error that causes a task to fail with “Cannot process dataset”.
I’ve heard mention of this issue being due to a race condition. Previously I’ve come across it intermittently but now I’ve got a scenario where it has happened 4 times out of 4 attempts, always around 1 hour into SFM processing stage when doing image matching.
If I use the non-GPU NodeODM node it completes image matching and goes on to the next processing stage.

My system:
Ubuntu 20.4 LTS
HP Proliant DL380P
Dual Intel Xeon 20-core CPUs
Nvidia K80 GPU (dual GPUs with 12GB RAM each)
1.5TB swap file
RAID (magnetic disks)

Data set:
Almost 7400 images at 20MPx (taken using a P4RTK)
90 GCPs

WebODM project task Settings:
auto-boundary: true, debug: true, dsm: true, dtm: true, smrf-threshold: 0.01, verbose: true

WebODM Developers - I can arrange remote access to my server if you want to investigate, just contact me at [email protected]

Seems the same as Fails when processing ~2gb worth of imagery · Issue #1506 · OpenDroneMap/ODM · GitHub


I’m not a software developer but I agree, it looks like the same error to me.
Let me know if there’s something I can do to be helpful.

Isolating a test case that can be reproduced quickly (without having to wait 1 hour of processing) would probably help, if it’s possible. Does it happen with a small dataset (less than a few dozen images)?

Getting a copy of the core dump would also help.

You could get one by issuing (as root):

ulimit -c unlimited
echo "/code/core" > /proc/sys/kernel/core_pattern

Running the process, waiting for it to crash…

Without stopping the docker container, fetch the core dump files:

docker ps -a
<fetch the container ID ...>

docker cp containerId:/code/core ~/ 

This should copy the core dump in the user’s home directory.

Btw, I just pushed a partial fix for this, which will let processing run to completion, but doesn’t solve the core issue (and makes processing slower than CPU processing when this error happens).

I’ve only been able to replicate it consistently with my biggest dataset (7000+ images), and intermittently with 1000+ image datasets.
I’m having a go at generating a core dump & will upload it to the Github thread in an hour or so.
I’m assuming the nodeODM-GPU container is the appropriate one…?

I’m sure I’ve seen it with small datasets, will test out a few to see if I can find one.

@pierotofy - I’ve replicated the error and got a core dump. Core dump file is 19GB so it’ll take a while to copy.
When it’s done I’ll put a link for you to download it in the Github thread.


@pierotofy - posted link to my core dump in the Github thread just now.
Let me know if it is helpful or you need more data.

