I’m processing 5200 images on an EC2 instance with this specification:
- Ubuntu 20.04
- 32 vCPUs
- 128 Gb of memory
- NVIDIA T4 Tensor Core GPU
I’ve been trying to run it with a GPU and a regular ODM docker image. In both cases the last message in the docker container is
[INFO] Aligning submodels...
After that, the container exits with the 138 exit code. Its final state is
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 139,
"Error": "",
"StartedAt": "2021-07-27T08:04:25.465713163Z",
"FinishedAt": "2021-07-27T11:12:40.418633498Z"
},
We can see that it’s not OOM killed. I monitor memory and GPU usage and at the moment of failure there’s plenty of free memory and the GPU is not used.
The system log on the host contains this message:
Jul 27 11:12:21 ip-10-1-2-105 kernel: [12381.831006] python3[49809]: segfault at 53 ip 00007fdb4af87c9d sp 00007fffa3db67e0 error 4 in pymap.cpython-38-x86_64-linux-gnu.so[7fdb4af45000+9c000]
As I said I ran the ODM processing with ODM and ODM-GPU docker images. And it ends up with the same error message. I also tried to set it up on Ubuntu 18. The result was the same there too.
My docker run command was “docker run -d -v /mnt/data:/datasets/code --gpus all opendronemap/odm:gpu --project-path /datasets --pc-las --split 100 --split-overlap 10”
Do you have any insights on what could have gone wrong?