DevOps

Why CPU utilization doesn’t tell the whole story

A guide to right-sizing infrastructure in a mobile edge computing environment

Gene Bagwell, Senior Evolution Architect, Verizon

About this post

In the cloud computing literature, CPU utilization is one of the most popular infrastructure monitoring metrics and is widely available across the major cloud service providers’ management tools. However, in a mobile edge computing (MEC) environment, carefully selected infrastructure metrics are of paramount importance to developers — to ensure that infrastructure is sufficiently elastic, as well as that teams aren’t overpaying for idle compute cycles.

Introduction

It’s 3 AM, and your operations manager just called. The platform is as slow as molasses in January, and subscribers are calling into the Network Repair Bureau (NRB) complaining. You roll out of bed, boot your laptop, log into the VPN and pull up the element management system (EMS) platform — and you see this:

CPU utilization demystified

CPU utilization is the measurement of “non-idle time.” Or, more accurately, of when the CPU is not running the idle thread. The kernel usually tracks this during context switches. If a non-idle thread begins, runs for 100 milliseconds, and stops, the kernel considers that CPU as utilized for that entire time, even if it was waiting for some I/O operation to complete.

So what’s wrong with CPU utilization?

CPUs are faster than memory and have been for some time. The difference in CPU speed and memory speed is known as the von Neumann bottleneck.¹ Waiting on memory dominates what we call CPU utilization. When you see a high %CPU in a tool like top,² or on a graph, you probably think the processor is the bottleneck, when in reality it’s the DRAM sitting next to the CPU.

How to tell what your CPUs are doing

Performance Monitoring Counters (s) are hardware counters that can be read using the Linux® perf command.

  • perf record: record events for later reporting
  • perf report: break down events by process, function, etc.
  • perf annotate: annotate assembly or source code with event counts
  • perf top: see live event count
  • perf bench: run different kernel microbenchmarks
The perf package is available as part of the eBPF package in Linux 4.X and newer.

Interpretation and actionable items

If the IPC is < 1.0, you are likely memory stalled. You have two paths to look at:

  • Hardware tuning, including using processors with larger CPU caches and faster memory, busses and interconnects
  • For hardware tuning, try a faster clock rate and more cores/hyperthreads

Why this matters for 5G Edge deployments

Verizon 5G Edge provides developers with access to Amazon Web Services® (AWS®) compute (Amazon® Elastic Compute Cloud™ [EC2®], Elastic Container Service [ECS] and Elastic Kubernetes Service [EKS]) and storage (Amazon Elastic Block Store [EBS]) topologically closer to the end user than ever before. However, developers must also be mindful that with great power — an Availability Zone within the edge of the mobile network — comes great responsibility to maintain reliability. And, more importantly, to maintain cost efficiency. To get the most out of your investment in 5G Edge infrastructure, it’s important to rightsize your infrastructure accordingly. Said differently, if you have an AWS Auto Scaling trigger based on an underlying metric that is misrepresenting the current state of your compute, would you want to continue using that metric?

Practical advice

AWS CloudWatch, which natively supports infrastructure in all Wavelength Zones, provides a rich set of tools to monitor the health and resource utilization of various AWS services. CloudWatch supports custom metrics, which are user-defined data points that are collected and, in turn, used to set up alarms, send notifications and trigger actions upon alarms firing. Using the native perf library in Linux 4.X, developers could publish custom metrics to CloudWatch (using the PutMetricData API) rather than off-the-shelf metrics such as CPU Utilization (%).

Conclusion

Through our foray into the history of the CPU-DRAM gap, perf, and much more, I hope you’ve learned not to be fooled by CPU utilization. Just to recap our top three pieces of advice:

  • CPU utilization is a misleading metric: It includes cycles waiting on memory, disk and network I/O. Waiting on memory I/O can dominate modern workloads. You can figure out what your %CPU means by gathering additional metrics, like instructions per cycle, from the Linux perf command
  • Remember, an IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound

Powering the next generation of immersive applications at the network edge.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store