DevOps & Cloud
A Game of Tetris: Tips and Tricks to Deploying Your Distributed App
A deep dive into Verizon’s approach to virtualization and practical advice for your first distributed app on 5G Edge.
By Brent Segner, Network Infrastructure Planning, Verizon
With the exponential increase in cloud computing over the last decade, one of the problems that has emerged is the efficient placement of workloads. This problem is compounded when you attempt to place workloads with greatly diversified resource requirements into environments with limited capacity over different geographies. As the workload placement is a multidimensional problem, it is necessary to consider multiple dimensions, such as CPU usage, memory and network bandwidth, in conjunction with external constraints, including latency and site availability. While this problem is not completely solved by any means, this article will seek to explore the issue at a deeper level and provide some best practices to deal with it from an architecture perspective.
Verizon’s approach to virtualization
Few companies have benefited from the evolution to cloud computing and virtualization to the degree that Verizon has. Working closely with industry partners, Verizon has spent the last six years virtualizing hundreds of its network functions onto an internally developed and managed cloud infrastructure. The Verizon Cloud Platform (VCP) infrastructure is architected as a high availability and low-latency environment that currently spans over 100 discrete locations across a global footprint.
The process of migrating Verizon’s 3G and 4G network elements from the traditional physical network functions (PNF’s) that existed on dedicated, purpose-built hardware, to virtualized network functions (VNF’s) that now exist on shared commodity hardware, has enabled Verizon to realize millions of dollars in both capital and operating expense savings. These savings come from several different factors, but none more impactful than the ability to efficiently offer a virtual pool of workload-agnostic resources (vCPU, RAM, Storage).
There are a variety of different resource and performance requirements that come with the diversity of network functions existing within the Verizon environment. In order to accommodate the different application performance needs, the Verizon Cloud Platform (VCP) infrastructure is segmented into different geographic and performance-based tiers that can dynamically grow or shrink as needed. While this segmentation of infrastructure into different tiers solves the immediate business need, it also serves to bring a more significant problem of workload placement into focus.
For an application to function as designed, it must be placed into an environment with sufficient virtual resources available (CPU, RAM, Storage), while also accounting for the latency requirements with adjacent network functions. This makes workload placement of critical importance when it comes to running an application as efficiently as possible. As the number of applications running within the cloud environment increases, the process by which workloads can be placed into an appropriate subcloud that meets both latency and performance requirements can start to resemble a game of Tetris.
Why is this like a game of Tetris?
In the mid-1980s, mathematician Alexey Pazhitnov created a game where geometric pieces known as tetrominoes are randomly generated and placed within a rectangular board. Tetris, as it became known, was an international sensation by the late 1980s, with players worldwide attempting to pack their grid as tightly as possible with varying, odd-shaped objects. There is a finite amount of space available, and the game becomes more difficult as the speed at which the tetrominoes need to be placed increases and the area the player has to work with decreases. Players reveled in the challenge.
Similar to a game of Tetris, there are finite limits to the available space for deploying a virtual workload in a cloud environment. Rather than placing applications in a computer-generated 20- by-10 game board, the constraints in a cloud environment are dictated by the number of locations, volume of virtualized resources exposed from the physical hardware and network latency between regions. Just as players sought out the challenge of finding an optimum way to pack the Tetris game board, Verizon engineers continually look for the optimal way to place workloads into various VCP locations or geographic regions across the globe. Workload placement in a dynamic shared environment is no simple task, as specified performance requirements, along with dependencies, resource needs and the affinity rules that ultimately define the resilience of an application must all be considered.
Where are the challenges when deploying to 100-plus DCs?
While playing this game of workload-placement Tetris on a single board is challenging, playing it on a hundred different boards simultaneously requires the entire workload placement process to be reimagined. That is the exact paradigm that VCP engineers have found themselves in as the wireless network transitions into virtualized network functions running on cloud infrastructure. When VCP started as a small private cloud in a couple of “core” data centers, the engineers had to manage workload placement for a small number of applications in just these few locations, making it a relatively simple task. But as VCP has matured, so have the types and volume of workloads looking to virtualize. This resulted in a steady push to move the cloud ever closer to the edge of the network, resulting in more than 200 domestic and international sub clouds — which will eventually evolve into tens of thousands of additional clouds to support 5G.
At the scale described above, it would require an army of engineers to maintain the ongoing analysis to ensure the VCP has the resources to meet the placement and performance requirements associated with placing hundreds of applications into potentially thousands of VCP locations. Since the operational costs and inherent delays associated with manual infrastructure and workload management would be unsustainable, the VCP team has leveraged the power of artificial intelligence (AI) and machine learning (ML) to facilitate the optimization of workload placement within an environment. The AI/ML work Verizon has done in this area creates rule-based automata using custom algorithms that determine optimal workload placement within each of the various sub cloud environments.
These algorithms take into account a number of factors relevant to both the infrastructure and application. Several examples of infrastructure factors would be geographic placement, latency, and number and type of resources available at a particular location (CPU, GPU, etc.). These infrastructure factors are then taken into consideration when looking at application requirements (adjacent services, latency tolerance, resource requirements and redundancy models) to determine an optimal placement for the service.
The algorithms consistently work to ensure that the environment is optimized by suggesting how to shift applications around based workloads spinning up or down on the platform. While use of AI/ML does not completely remove the need for manual oversight, it does enable Verizon to manage the workloads with significantly less hands-on engineering. As Verizon’s use of AI/ML continues to mature for this use case, the overall process will continue to become more targeted and efficient. Early results have shown great promise as AI/ML recommendations on workload placement have been equal or better than human outputs for the majority of the scenarios and produced, in a mere fraction of the time.
Playing Tetris in 5G Edge
As Verizon continues to build out its 5G network, the need for a data-driven approach to managing the utilization of virtual resources will only continue to grow. The 5G network will introduce a whole new scope and scale of considerations, including tens of thousands of sub-clouds, network slicing and multiple Wavelength Zones, as well as third-party developers leveraging the network through mobile edge computing (MEC). To keep pace with the demand, it will become more critical than ever to leverage the engineering teams’ domain expertise, combined with AI/ML to optimize and maintain rule-based systems to manage the workloads. With the speed of technological change, the systems that manage that infrastructure and workloads will need to keep pace and continue to evolve. It is of paramount importance to maintain the flexibility to handle the yet unknown, as new schemas, rules and considerations are introduced.
Advice from practical experience
With the introduction of MEC, the placement of the workloads within a cloud environment will start to become a challenge that moves beyond the traditional telecom engineering teams. This challenge of improving application performance and stability, while determining the optimal geographic placement for workloads, will become a more pervasive challenge for traditional application development teams.
Understand requirements
It is important for developers who want to start deploying applications into edge locations to first develop a clear understanding of their performance requirements — down to the most granular level possible. Once there is an understanding of task-level latency tolerance, resource needs (CPU, RAM, Storage), geographic requirements and resiliency model, the infrastructure can be closely evaluated to determine how those needs can best be met. Depending on the complexity of the application deployment at scale, harnessing AI/ML toolsets are a good option for developing new ways to move tasks and workloads throughout the environment to reduce cost and improve application performance.
Exercise restraint on affinity/anti-affinity rules
Years of virtualization experience within Verizon have taught us to use some caution when looking at application requirements, with respect to establishing affinity/anti-affinity rules. While they are a tremendous tool to help ensure a higher level of performance and stability within an application, restraint should be used when utilizing the capability. Setting arbitrary rules on where the scheduler can place workloads has the potential to limit the application’s ability to scale and provide the critical resources necessary to meet demand. In a centralized cloud environment where there are numerous hosts to schedule on, it does not pose as large of a challenge, but as the workloads are pushed closer to the edge where there are fewer options, the restrictions can quickly become problematic.
Know when and how to scale
When constructing your application, it is essential to consider three workload scenarios as you are defining the autoscaling requirements. The scenarios are:
- Predictable bursting workload pattern
- Unpredictable bursting workload pattern
- On-and-off workload pattern
An ordinary autoscaling practice is mainly to employ fixed, infrastructure-level CPU-based auto-scaling rules, to scale up or scale down the number of container instances allocated to a specific service depending on demand. Although these existing reactive auto-scaling methods with fixed rules may be appropriate for some legacy applications, they may result in an undesirable QoS or poor resource utilization, by either scaling unnecessarily or not at all when needed. To ensure a consistent user experience, without incurring unnecessary costs, it is crucial to ensure that the design incorporates appropriate trigger options to scale in and scale out for each scenario.
Conclusion
The ability to develop applications for use within an edge environment presents an unparalleled opportunity to create new services, as well as immersive user experiences. To do this effectively, it is important to understand what both the resource and performance requirements of the application are, and the environment they are being placed into. As applications start to look beyond traditional VM-based virtualization, into a containerized deployment, there are additional tools and capabilities that are exposed to help optimize workload placement and ensure a consistent performance. Ultimately, adopting this more modular and dynamic container-based infrastructure will provide the ability to address changing workload intensity over time, by reducing expensive over-provisioning situations, as well as poor performance that results from resource under-provisioning.
In the next blog, we will discuss why CPU utilization is not always the best metric for trigger scaling, as well as some more appropriate alternatives.