Bad zipfile error when using nodeODM GPU (race condition?)

I’ve now got a project where I can consistently replicate the “bad zipfile” error that causes a task to fail with “Cannot process dataset”.
I’ve heard mention of this issue being due to a race condition. Previously I’ve come across it intermittently but now I’ve got a scenario where it has happened 4 times out of 4 attempts, always around 1 hour into SFM processing stage when doing image matching.
If I use the non-GPU NodeODM node it completes image matching and goes on to the next processing stage.

My system:
Ubuntu 20.4 LTS
HP Proliant DL380P
Dual Intel Xeon 20-core CPUs
512GB RAM
Nvidia K80 GPU (dual GPUs with 12GB RAM each)
1.5TB swap file
RAID (magnetic disks)

Data set:
Almost 7400 images at 20MPx (taken using a P4RTK)
90 GCPs

WebODM project task Settings:
auto-boundary: true, debug: true, dsm: true, dtm: true, smrf-threshold: 0.01, verbose: true

WebODM Developers - I can arrange remote access to my server if you want to investigate, just contact me at [email protected]

1 Like

Seems the same as Fails opensfm_main.py when processing ~2gb worth of imagery · Issue #1506 · OpenDroneMap/ODM · GitHub

2 Likes

I’m not a software developer but I agree, it looks like the same error to me.
Let me know if there’s something I can do to be helpful.

1 Like

Isolating a test case that can be reproduced quickly (without having to wait 1 hour of processing) would probably help, if it’s possible. Does it happen with a small dataset (less than a few dozen images)?

Getting a copy of the core dump would also help.

You could get one by issuing (as root):

ulimit -c unlimited
echo "/code/core" > /proc/sys/kernel/core_pattern

Running the process, waiting for it to crash…

Without stopping the docker container, fetch the core dump files:

docker ps -a
<fetch the container ID ...>

docker cp containerId:/code/core ~/ 

This should copy the core dump in the user’s home directory.

1 Like

Btw, I just pushed a partial fix for this, which will let processing run to completion, but doesn’t solve the core issue (and makes processing slower than CPU processing when this error happens).

1 Like

I’ve only been able to replicate it consistently with my biggest dataset (7000+ images), and intermittently with 1000+ image datasets.
I’m having a go at generating a core dump & will upload it to the Github thread in an hour or so.
I’m assuming the nodeODM-GPU container is the appropriate one…?

1 Like

I’m sure I’ve seen it with small datasets, will test out a few to see if I can find one.

1 Like

@pierotofy - I’ve replicated the error and got a core dump. Core dump file is 19GB so it’ll take a while to copy.
When it’s done I’ll put a link for you to download it in the Github thread.

3 Likes

@pierotofy - posted link to my core dump in the Github thread just now.
Let me know if it is helpful or you need more data.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.