DevOps & Cloud
Designing stateful applications in a 5G Edge world
Best practices for building 5G Edge applications that enable the next generation of disruptive digital experiences.
By Chafik Taiebennefs, 5G Edge Professional Services, Verizon
Life happens at the edge.
Life, in the real world, is a culmination of experiences that happen in the here and now. In the digital world, here and now are defined by the real-time understanding of our location, our interactions and the events around us. We exist in a densely digitized “edge,” where everything has a processor, is connected and produces a constant stream of data that captures our every digital move.
Living such hyperconnected lives raises the bar on the type of digital experiences we have — and will — come to expect. Our lives, which are increasingly dependent on digital experiences, cannot operate in buffered mode any longer. We cannot pause while waiting for a round trip to a distant cloud data center to complete. Our lives demand instant experiences that constantly reflect our real-time interactions, unique profiles and dynamic behaviors.
Real life is stateful
In this complex digital web, users, devices and applications are constantly interreacting, which generates deluges of data (i.e., “state”). That massive amount of data, either by itself or combined with other contextual data, creates the real-time “context” that is vital to delivering rich and meaningful digital experiences.
Edge computing can be easy when those interactions are stateless — as in, when the application doesn’t keep a record of previous user, application or device interactions. However, creating meaningful experiences requires up-to-date context, which in turn requires processing and making sense of the real-time trail as well as data, location, and previous interactions.
That’s why IoT devices, mobile apps, gaming platforms, virtual and augmented reality (VR and AR), and other edge computing applications must have the ability to share this contextual and stateful information in a timely and coordinated manner. As such, those stateful edge applications will come to define the next generation of disruptive digital experiences.
With the emergence of 5G connectivity and edge computing, such as with 5G Edge with AWS® Wavelength, we are much closer to delivering on the promise of instant, rich, and meaningful digital experiences. But as enabling as these technologies can be, they are not a silver bullet — we still need to solve the problem of application state at the edge.
The ubiquitous stateless application model, where the almighty central database stores the complete view of the application data and state, would not work at the edge. While that pattern works well for web and cloud applications, the real world that is unfolding at the edge and in real time requires a different paradigm.
At the edge, experiences are driven by localized context and informed by our mobility. Data that is streaming from edge devices and users is perishable, dynamic and distributed across a large landscape of complex microdeployments. Each edge node provides a unique location-sensitive context that can illuminate the overall application experience. And it’s critical that each is maintained, distributed and made locally available to applications at the edge — consistently and in near real time.
Current state limitations and challenges
While architecting for distributed state/data is not a new concept, bringing that paradigm to the edge of the network surfaces a new set of application challenges. In the context of edge deployments, the ultimate goal for application developers is to fully optimize the whole application stack so as not to squander the latency improvements gained from the low latency enabled by 5G and edge computing. At the end of the day, if it takes 100 ms to travel up the application stack through your load balancer, API gateway, microservices and database (the focus area for this blog), what is the point of running your application in a sub-20-ms AWS Wavelength Zone and connecting your end users and devices via ultrafast 5G?
Conventional databases are great at centralized coordination of application data, which makes sense when that data is being processed in (or close to) one place, such as a data center or the cloud. However, when the application is dependent on the processing of vast amounts of streaming data across multiple edge locations, centralized coordination becomes an issue, creating the very problem that edge computing is supposed to solve: latency.
Pushing application-data-stores coordination to the edge (e.g., Wavelength Zones) does not necessarily solve the problem either, especially when strong data consistency or very close eventual consistency is required. Across a distributed edge application, application processing on one edge node may go idle waiting on another edge node to successfully commit the application state. Rather than delivering the end-user experience, the application is forced to wait until the stateful data is processed and committed across the edge network. As a result, latency creeps back into the very architecture meant to overcome it.
To effectively combat those two challenges, what are needed are microdistributed data stores that orchestrate state at the edge at very low latency but can scale across tens and hundreds of microedge locations while still providing a single, coherent application state view and a “local” data access experience to developers.
Design considerations and solution options
When it comes to distributed edge applications, there are a few key design assertions worth keeping in mind.
- Most edge use cases (smart cities, venues and events, industrial factories, retail and shopping, AR and VR) are by nature localized experiences. That means they would only need to maintain stateful application data across a handful of compact microedge nodes, not a large swath of geodistributed sites. After all, a fleet of autonomous food-delivery bots in Chicago does not need to share real-time order delivery states with bots in California. Or do they?
- Workloads deployed at the edge will benefit from highly available, low-latency intraedge nodes backbone networks (made available by telco providers like Verizon). That will enable distributed databases to work as effectively and reliably as they would within the confines of an enterprise data center or across hyperconnected cloud AWS Availability Zones
Also, a critical piece of the distributed edge application solution stack is the database or data layer. To deliver stateful, highly resilient near real-time application performance, developers have to think through several key design considerations and solution options.
So how does one choose the right data store for a stateful edge application or microservice? To answer that, developers must factor in several performance, resiliency, consistency and data model considerations into the decision process.
First, developers have to realize that when choosing a database or data store for their applications, they are not just choosing one piece of technology — they are choosing three: storage engine, data model and API/query language. For example, if you pick Postgres®, you are actually choosing the Postgres disk-based storage engine, a relational data model and the SQL query language. On the other hand, if you pick MongoDB®, you are choosing the MongoDB distributed storage engine (either disk based or in-memory based), a multimodal data model and the MongoDB API.
Also, while today’s traditional databases can scale to meet stateful data requirements within a data center or a few hyperconnected data centers (e.g., AWS Availability Zones), they can’t effectively scale out across large, dispersed microgeographic areas (i.e., edge nodes). Moreover, one key challenge with most of the distributed databases and data streaming systems today is that at their foundation, they rely on consensus protocols like Raft or Paxos, which are meant to run inside data centers with ultrareliable networks and very low latency. But as we stated earlier in our design assertions, this would or should be addressed through hyperconnected edge nodes, which would make it less of a concern.
So to help you in your quest to pick the right distributed data store for your stateful edge application or microservice(s), here are some fine-grained design considerations and solution options to consider:
Data consistency requirements: Strong consistency vs. eventual consistency
Every stateful application, edge or not, has to deal with data consistency (strong vs. eventual consistency) requirements — that’s nothing new. However, porting that very same challenge into the context of edge applications does complicate matters (especially when it comes to strongly consistent data) due to the increased complexity (high node distribution), dynamism (constant influx of context-dependent data streams) and real-time (low latency) aspects of the solution field.
However, not all edge application data needs to be strongly consistent. To reduce solution complexity and countless hours of development work, developers should start by mapping out their edge application data-by-data consistency requirements. Data that requires strong consistency is better centralized (e.g., in the cloud) if latency is not an issue. At the edge, eventual-consistency data is easier to implement with minimal overhead, because of the relaxed coordination and consensus (hence latency overhead) requirements.
For example, let’s say you are a start-up and you partner with the city of Chicago to drive citizen engagement through gamified real-time shopping experiences for families. Each family (several members) gets to share a single shopping point’s balance and can earn points by engaging in different quests and activities around the city. Your application will be deployed on multiple AWS Wavelength nodes in Chicago, and you need to replicate the credits or points balance data in real time and provide each edge application deployment with a strongly consistent view of that data. In this case, a multimaster, strongly consistent data store is required at the edge, where storing the family members’ profiles can be centralized in the cloud and cached with eventual consistency at the edge.
Data store performance
When it comes to edge applications, it’s important to design every service or microservice to provide the best throughput and response latency. If one critical-path microservice becomes a bottleneck in the application flow, the overall user experience can suffer significantly, hence squandering the latency gains obtained from edge computing and 5G connectivity.
Having said that, not all edge application microservices are latency sensitive. Edge application developers need to define a data access latency model to map out different data access latency requirements for their application microservices: Which microservice requires which data to be stored, processed and distributed at the edge vs. in the cloud? Also, edge applications typically deal with large volumes of real-time data streams (think IoT sensors, video streams, lidar point clouds, etc.). Hence, they are usually very write intensive and require high-write performance data stores.
As an example, imagine you are building a food delivery robot application and you need to track real-time mechanical health telemetry across your entire fleet of delivery robots in each city. The high-resolution, high-volume real-time health telemetry data might require low latency at the edge in order to enable the real-time detection of robot mechanical anomalies through machine learning inference.
These kind of performance requirements are best met by distributed in-memory data engines like Redis®, InfluxDB®, eXtremeDB®, or other highly performant NoSQL databases such as MongoDB (with an in-memory storage engine like the open source Percona® Memory Engine). When provisioning in-memory data stores, there are some important compute infrastructure considerations to keep in mind:
- These types of databases are liable to be either compute bound or network I/O bound, so it is important to provision compute- and network-optimized instances and infrastructures to run them
- Some are more memory hungry than others and would greatly benefit from a large memory capacity. Redis, for some nontrivial data sets, can use a lot more RAM compared with MongoDB to store the same amount of data
Data model and structure
Usually, edge applications have to deal with different types of data structures, e.g., key value, time series, document, blob, graph, relational, etc. So the question of which database or data store solution to pick might have multiple answers. However, introducing a polyglot data store architecture into your edge application will come with significant operational overhead and still not guarantee the desired polyglot data consistency across all your data stores. The alternative — and recommended — approach, in this case, is multimodel database systems (e.g., MongoDB, Redis) that support multiple data structure types. This allows a single database system to:
- Support multiple data models (required by different edge application microservices) against a single integrated back end
- Leverage a unified query language to retrieve multimodel data views such as JSON graph, key value, etc.
- Guarantee multimodel data consistency — i.e., polyglot persistence
If, however, your application is mostly concerned with one type of data model or structure, let’s say streaming or JSON-like data formats, a lightweight distributed streaming database like eXtremeDB or InfluxDB might be a better option.
Data replication approach
Data replication is at the heart of making data durable, consistent and available for distributed edge applications. The basic idea behind edge data replication is simple: Keep multiple copies of data on physically isolated edge sites consistent (strong or eventual) and highly available, with no data loss. However, the implementation of distributed data replication is far from simple, and none of the approaches and techniques discussed here can be viewed as silver-bullet answers.
At the end of the day, all distributed replication protocols are constrained by the law of the CAP theorem, which states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: consistency, availability and partition (CAP) tolerance.
When it comes to data replication, edge application developers have to evaluate the distributed data stores with regard to how they handle two main key replication domains: ringleader election (which defines resiliency and availability) and data replication (which defines consistency). Also, developers have to determine which two of the three CAP guarantees they want to optimize for in their stateful edge applications.
Today, developers have two main replication design patterns to choose from: consensus-based and nonconsensus replication protocols:
If strong data consistency is desired, consider databases with a consensus-based replication protocol. Consensus protocols can be broadly classified into two categories: leader based and leaderless. Paxos and Raft are the two most commonly used leader-based consensus protocols, and most strongly consistent distributed databases have standardized onto one of the two. Also, a design best practice common in distributed databases leveraging Paxos or Raft is to apply Paxos or Raft on an individual shard level as opposed to all the data in the database. That means the leaders (of the various shards) are not present on a single server, but are distributed across all of the servers, making these systems much more resilient to failures.
While Paxos has evolved into a family of protocols with various trade-offs, it still remains complex to implement. Consequently, Raft has widely become the de-facto standard for achieving consistency in modern distributed database systems. Next-generation distributed databases, such as YugabyteDB, CockroachDB®, and TiDB, use Raft for both leader election and data replication. Also, popular databases such as MongoDB and InfluxDB use Raft partially. MongoDB’s leader election in a replica set became Raft based as of version 3.4, but data replication remains asynchronous. InfluxDB uses Raft for high availability of its metadata node while using simple nonconsensus replication for the actual data nodes.
If strong data consistency is not required, consider databases with a nonconsensus replication protocol. A common alternative to Paxos and Raft is nonconsensus (i.e., peer-to-peer) replication protocols, such as those used by first-generation NoSQL databases like Amazon® DynamoDB®, Apache® Cassandra®, Couchbase® and InfluxDB. The main challenge with peer-to-peer replication protocols is that they are prone to data loss upon failures — concurrent writes on the same record at two different replicas (e.g., edge nodes) are considered perfectly valid, and the final data value has to be reached nondeterministically using heuristics such as last writer wins (LWW) and conflict-free replicated data types (CRDT).
Although the edge is very much here, and edge applications are proliferating, the real power and promise of the edge, IoT and 5G can only be unlocked with stateful edge applications — and as such, distributed edge databases are key to achieving that promise. Now go build them!