Helping to Understand the Mystery Of Linux System Load Averages.
While load averages are a metric that are widely used for monitoring the health of a system. Quite a few organizations use this metric for capacity planing without fully understanding where the limits should be. But most admins are not sure how to read and or respond to a systems load average. They simply go on the principal Low is GOOD and High is BAD…
The Linux operating system uses the load average to track running tasks and tasks that are in the uninterruptible sleep state. I am not exactly sure why it was decided to do it this way but i will attempt to explain it in a fashion that makes it clear for everyone. Just looking at the name “Load Average” you can understand that it is an average, but an average of what?
If you think of it as the “System Load Averages”, that is displaying the average number of running and waiting threads (also could be considered a tasks). Essentially we are measuring the demand on the system based on what is currently in process and waiting to be executed. Most *nix systems will show this in a 3 sets of averages displayed as a 2 decimal place float. The averages are calculated over 1 min, 5 min and 15 min intervals. Below you can see an example of this.
[root@test ~]# uptime
08:14:35 up 2 days, 21:16, 1 user, load average: 0.00, 0.01, 0.05
Some basic takeaways:
- If the averages are all at 0.00, everything is running at idle.
- If the 1 min is higher than the 5 and or 15 min, your system load is increasing.
- If the 1 min is lower than the 5 and 15 min, your system load is decreasing.
- If any of the load averages are higher than the total number of CPU count, you MAY have a performance issue.
- NOTE: While this can be true, not always will this be the case…
While having all 3 of them is good, most of the time when using load averages for system scaling and cloud scaling. Most will use only one of the 3 as a single value of system demand, but without extra data for comparison you can not use load average alone. If you take a “t3.nano” from AWS that has 2 vcpu and .5gb of ram. And it has a load average of “4.29, 2.42, 1.23”, you can tell that the system load is increasing. But when you take into account that the 1 and 5 minute are over the count of cores available (2 vcpu). This may mean it’s time to scale the workload or upgrade the instance to a larger instance.
While I think it is important reading, I am not going to go into details about the history of load average. But i will give you a link for you to go read on your own. RFC546 – https://tools.ietf.org/html/rfc546
Now it is time to contradict myself… While above I told you that the 1, 5 and 15 minute numbers were the averages, they are not exactly 100% averages. And the “groups” are not really 1, 5 and 15 minute groups either. It is basically that it is used in the calculation of the metrics returned. But it is really a exponentially-damped moving sum of a 5 second average. The resulting 1, 5, and 15 min load averages reflect load well beyond 1, 5 and 15 minutes.
Another somewhat speculated aspect of this is why linux adjusted things to bring in uninterruptible tasks. Most other *nix based os’s do not include those tasks into the calculations. I have seen many articles that speculate over why the linux devs chose to do this, but no one really has the exact answer. I personally feel this was used as a method to calculate the more general system demand rather than a more specific CPU demand.