I’m trying to sort out some of the things around autoscaling in AWS. I’ve finally got a node up processing data, but still have a few questions:
what controls when nodes scale down? I left a job running overnight and when I got up this morning, the autoscaled node was still alive even though the job had completed. I have maxruntime set to -1 in the config, but looking at the code / comments in the code it looks like that only controls for stuck nodes and I’m not sure if that is going to kill a machine in the middle of processing for a long-running job
Does autoscaling work with split-merge? Do I need to specify split parameters, or will clusterodm do it automatically?
I am looking at modifying the code that launches docker-machine to remove the requirement for hardcoding credentials in the config. Docker-machine seems to work fine with an instance profile. Is there some other reason that hardcoded credentials are required? I think it is still useful when clusterodm is running outside of AWS, but i’d like the option to use instance profiles instead of hardcoding credentials
since I’m in here messing around with the AWS autoscaling, are there any other things on the backburner that I can help with? I’m already looking at extending some of the options for docker-machine to make it a bit more flexible (add an option to specify login user for ssh, specify private network, specify subnet, etc) Full disclosure… i’m an AWS infrastructure and security guy, not a node developer
Welcome! I have no experience with autoscaling but it seems to be a growing part of the ecosystem, and AWS is certainly a key provider. Any help in there would be appreciated. Probably Piero or Stephen can chime in with more specific guidance.
Some follow-up questions after playing some more. Im not sure if I’m missing something fundamental about this, but I don’t see any options to control this behavior in the settings:
Launching a new job always seems to spin up a new node, even if there is an idle one that has the capacity to run the job already running… why isn’t it placing tasks on existing nodes?
if you set max lifetime on a node, it kills the node when the time is up… even when its running a job. the autoscaler needs either another option that sets “max idle time” or checks to see if there is a running job before killing the instance.
Split-merge only launches a single instance sized for the total number of images. I can’t get the job to actually start (see my other post re: invalid token), so I don’t know if it launches more later or if its just splitting the job into chunks and running it on the same instance. My reading of the docs seemed to indicate it should spread the load across multiple nodes.