I’m trying to sort out some of the things around autoscaling in AWS. I’ve finally got a node up processing data, but still have a few questions:
- what controls when nodes scale down? I left a job running overnight and when I got up this morning, the autoscaled node was still alive even though the job had completed. I have maxruntime set to -1 in the config, but looking at the code / comments in the code it looks like that only controls for stuck nodes and I’m not sure if that is going to kill a machine in the middle of processing for a long-running job
- Does autoscaling work with split-merge? Do I need to specify split parameters, or will clusterodm do it automatically?
- I am looking at modifying the code that launches docker-machine to remove the requirement for hardcoding credentials in the config. Docker-machine seems to work fine with an instance profile. Is there some other reason that hardcoded credentials are required? I think it is still useful when clusterodm is running outside of AWS, but i’d like the option to use instance profiles instead of hardcoding credentials
- since I’m in here messing around with the AWS autoscaling, are there any other things on the backburner that I can help with? I’m already looking at extending some of the options for docker-machine to make it a bit more flexible (add an option to specify login user for ssh, specify private network, specify subnet, etc) Full disclosure… i’m an AWS infrastructure and security guy, not a node developer