Recently on a project we added Kubernetes Horizontal Pod Autoscalers to everything and just gave them all a “default” option for scaling based on some arbitrary CPU percentage (75% or so?).
This was fine as long as nobody was paying attention to it, but eventually we noticed:
- Massive scaling on deploys
- Some services massively scaling up on certain loads
So, after some investigation we found that:
- Spring Boot massively spikes CPU on boot due to classpath scanning or something (messes with the way the HPA’s work out CPU load and scales everything up)
- One of the services we had was basically implementing a brute-force algorithm to the knapsack problem and blocking (single threaded NodeJS) and scaling up massively
So the “old” way of doing things is that you “scale vertically” and basically over-provision CPU/RAM/disk to take care of “peak load” (plus a bit of buffer) and you don’t really care too much about the details until “peak” has increased.
With horizontal scaling, you need to think about:
- Startup times of “new” instances
- Enough “headroom” to take care of the increasing load until the new instances come online
- Knowing how quickly your load will increase (rate of increase)/how it will increase (linearly? exponentially?)
- What the relationship between requests (big O notation for requests being handled) and response/processing time is
These are the basics and you don’t need to know the exact number for them, but you need to ensure you have a general idea of the above (and observe behaviour during load tests and during peak times in production) in order to understand how your system behaves.
The main goal of horizontal auto-scaling is to eliminate the “waste” inherent in provisioning for “peak”, as such it introduces some extra complexity which means you really need to know your application your load profile and what sort of traffic you expect.