Gene Bagwell, Senior Evolution Architect, Verizon
Robert Belson, Corporate Strategy, Verizon
About this post
In the cloud computing literature, CPU utilization is one of the most popular infrastructure monitoring metrics and is widely available across the major cloud service providers’ management tools. However, in a mobile edge computing (MEC) environment, carefully selected infrastructure metrics are of paramount importance to developers — to ensure that infrastructure is sufficiently elastic, as well as that teams aren’t overpaying for idle compute cycles.
In this post, we’ll unearth one of the most popular misconceptions in infrastructure monitoring and offer a series of recommendations for how you can best monitor your 5G Edge applications today.
It’s 3 AM, and your operations manager just called. The platform is as slow as molasses in January, and subscribers are calling into the Network Repair Bureau (NRB) complaining. You roll out of bed, boot your laptop, log into the VPN and pull up the element management system (EMS) platform — and you see this:
What do you think the 90-plus-percent CPU utilization in the graph above represents? How busy your processors are?
Nope, that’s not what it measures. And yes, I’m talking about “%CPU.” This metric that we all use for CPU utilization is misleading.
So what is CPU utilization — really?
Does it mean this?
Or does it mean this?
Stalled means the CPUs are not processing instructions. They are stalled waiting on I/O — generally memory, but sometimes network or storage operations to complete. Understanding what your CPUs are doing, and when they are stalling, can direct system performance tuning and troubleshooting and help improve code efficiency. Anyone interested in CPU performance will benefit from knowing what the stalled components that make up %CPU are.
CPU utilization demystified
CPU utilization is the measurement of “non-idle time.” Or, more accurately, of when the CPU is not running the idle thread. The kernel usually tracks this during context switches. If a non-idle thread begins, runs for 100 milliseconds, and stops, the kernel considers that CPU as utilized for that entire time, even if it was waiting for some I/O operation to complete.
So what’s wrong with CPU utilization?
CPUs are faster than memory and have been for some time. The difference in CPU speed and memory speed is known as the von Neumann bottleneck.¹ Waiting on memory dominates what we call CPU utilization. When you see a high %CPU in a tool like top,² or on a graph, you probably think the processor is the bottleneck, when in reality it’s the DRAM sitting next to the CPU.
For 40 years, processor manufacturers have scaled their clock speeds faster than DRAM manufacturers could reduce memory access latency, hence the “CPU-DRAM gap.”³
Around 2005, this changed when manufacturers started scaling processors by adding multiple cores, hyperthreads, and multisocket configurations. All of which puts more demand on the memory subsystem. To try to assuage this, CPU manufacturers added larger and smarter CPU caches, and motherboard manufacturers created faster memory busses and interconnects. However, it’s all been for naught, as we are usually still memory stalled.
How to tell what your CPUs are doing
Performance Monitoring Counters (s) are hardware counters that can be read using the Linux® perf command.
Access to the PMCs is now integrated into the Linux kernel, assuming you are running a recent Linux distribution, and can be used to collect and analyze performance data about your application or system. To use the perf command requires the installation of the perf package in your Linux distribution. It’s one of many tools that really should be present on every Linux system.
The userspace perf command presents a simple-to-use interface with commands like:
- perf stat: obtain event counts
- perf record: record events for later reporting
- perf report: break down events by process, function, etc.
- perf annotate: annotate assembly or source code with event counts
- perf top: see live event count
- perf bench: run different kernel microbenchmarks
Here is an example of perf measuring the entire system for 10 seconds.
The key metric here is instructions per cycle (“insns per cycle”), which is also known as IPC. The IPC value shows the average number of instructions completed per CPU clock cycle. Generally speaking, a higher IPC indicates better performance.
The above example of 0.78 (78%) sounds busy — until you realize that this processor’s top speed is an IPC of 4.0. That’s because the processor shown in this example is an Intel® Xeon®, which is a 4-wide architecture. A 4-wide superscalar CPU architecture refers to the number of instruction fetch/decode paths on the CPU, which means that our example CPU can retire/complete four instructions every clock cycle. So an IPC of 0.78 on a 4-wide system means the CPU is running at ~19.5% of its top speed. Not what you want to see if you are trying to squeeze the best performance out of an application.
Interpretation and actionable items
If the IPC is < 1.0, you are likely memory stalled. You have two paths to look at:
- Software tuning strategies, including reducing memory I/O and improving CPU caching and memory locality, especially on nonuniform memory access (NUMA) systems
- Hardware tuning, including using processors with larger CPU caches and faster memory, busses and interconnects
If your IPC is > 1.0, you are likely instruction bound. To troubleshoot:
- Look for ways to reduce code execution to eliminate unnecessary work, cache operations, etc. CPU flame graphs are an excellent tool for this investigation
- For hardware tuning, try a faster clock rate and more cores/hyperthreads
Why this matters for 5G Edge deployments
Verizon 5G Edge provides developers with access to Amazon Web Services® (AWS®) compute (Amazon® Elastic Compute Cloud™ [EC2®], Elastic Container Service [ECS] and Elastic Kubernetes Service [EKS]) and storage (Amazon Elastic Block Store [EBS]) topologically closer to the end user than ever before. However, developers must also be mindful that with great power — an Availability Zone within the edge of the mobile network — comes great responsibility to maintain reliability. And, more importantly, to maintain cost efficiency. To get the most out of your investment in 5G Edge infrastructure, it’s important to rightsize your infrastructure accordingly. Said differently, if you have an AWS Auto Scaling trigger based on an underlying metric that is misrepresenting the current state of your compute, would you want to continue using that metric?
AWS CloudWatch, which natively supports infrastructure in all Wavelength Zones, provides a rich set of tools to monitor the health and resource utilization of various AWS services. CloudWatch supports custom metrics, which are user-defined data points that are collected and, in turn, used to set up alarms, send notifications and trigger actions upon alarms firing. Using the native perf library in Linux 4.X, developers could publish custom metrics to CloudWatch (using the PutMetricData API) rather than off-the-shelf metrics such as CPU Utilization (%).⁴
As an example, to meet unpredictable increases in traffic demand, we can add more instances to the Auto Scaling group using our custom-defined scaling policies and let our load balancer take care of distributing the traffic across the instance fleet. This results not only in an overall reduction in per-instance load, but also in a more effectively rightsized infrastructure.
Outside of CloudWatch, CodeGuru Profiler can also help optimize your application logic to identify long-running lines of code, as well as how your code is “spending its time” (i.e., distribution of thread states).
Through our foray into the history of the CPU-DRAM gap, perf, and much more, I hope you’ve learned not to be fooled by CPU utilization. Just to recap our top three pieces of advice:
- The performance-monitoring products that show %CPU, which is all of them, really should show PMC metrics to better explain what is going on
- CPU utilization is a misleading metric: It includes cycles waiting on memory, disk and network I/O. Waiting on memory I/O can dominate modern workloads. You can figure out what your %CPU means by gathering additional metrics, like instructions per cycle, from the Linux perf command
- Remember, an IPC < 1.0 likely means memory bound, and an IPC > 1.0 likely means instruction bound
If you follow these words of wisdom, you may never get that 3 AM call again. (We hope.)
: Von Neumann Bottleneck: https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_Neumann_bottleneck
: CPU memory gap: https://pdfs.semanticscholar.org/6ebe/c8701893a6770eb0e19a0d4a732852c86256.pdf
: Note: vPMU is only available on EC2 today for dedicated hosts.