DEVOPS & CLOUD
Is is latency? Or is it an outage?
How to reduce latency and minimize outages
By André Srinivasan, Global Solutions Architect, Alliances, Platforms, Edge Compute, IoT and Emerging Technologies, Redis
It’s not just latency that’s problematic.
If we define latency as the maximum delay between the time a client issues a request and the time the response is received, how do you know it’s latency and not an outage? Or abandonment, another form of outage?
I’ll use the examples of real-time inventory and fraud detection to illustrate how you can reduce latency and, by extension, eliminate outages.
First, think about real-time inventory, specifically in the context of the pandemic. We’ve evolved our shopping habits during COVID-19; we think long and hard about whether we have to go into the store, and we make sure that what we need is actually in stock. Our plan is often to go to one store, get in, get our stuff and get out.
But if the store’s inventory system, or the public view of it, is slow, I may assume that what I need is available and go to the store. If it isn’t there, I’m going to leave. After all, I was there for a specific reason. I may also stop by customer service to complain — not just about the slow system, but also the lack of inventory. The result is lost revenue for the store, as well as lost time listening to complaints. In this case, latency in the inventory update is effectively an outage for that system.
Fraud detected too late
Now let’s move on to fraud detection. In a nutshell, fraud systems operate on relatively static data and were developed before there was a need to incorporate real-time data as part of the risk calculation. If we think about a digital identity or user profile, this idea of verifying the customer’s identity can be thought of as a mix of static data (e.g., mailing address) and dynamic data (e.g., recent purchases). Most likely, that static information isn’t so secret, due to data breaches — and bad actors know this and can develop strategies to defeat common fraud-detection strategies.
The fraud may eventually be detected, but often only after the transaction is already completed. In effect, that latency — the delay in incorporating real-time data — means the fraud-detection system was essentially unavailable. So, as a practical matter, it could be viewed as having suffered an outage.
At this point, it would be perfectly relevant for you to say, “Wait a minute, we keep modernizing our architectures [as Figure 1 below indicates]. How can this still be an issue?”
I would suggest that as we’ve evolved architectures, we’ve solved for availability and delivering on an SLA of five nines. We started out with relational databases, then broke up the data into multiple relational databases for performance, and then introduced data-type-specific stores to further improve performance.
But multiple data stores also introduce multiple copies of data and create consistency challenges. We were able to address this with event-driven architectures using a message bus such as Apache® Kafka®. We therefore accomplished our five nines of availability, and we’re mitigating the challenge of data consistency across all the systems at the cost of complexity. But we haven’t addressed latency overall. Meanwhile, consumers are more frequently using mobile devices to better inform themselves — all of which points to a need for speed more than ever before.
Latency in the data store
If we turn our focus in this event-driven microservices architecture to the data store, I would suggest there are many tools to solve the latency problem. As a solutions architect at Redis, I can speak to the tools I use that operate at submillisecond latency in this type of architecture. Redis Enterprise is an in-memory database platform that maintains the high performance of open source Redis® and adds enterprise-grade capabilities for companies running their business in the cloud, on-premises and with hybrid models. The blog post “Redis Enterprise Extends Linear Scalability with 200M ops/sec @ <1ms Latency on Only 40 AWS Instances” provides further details about how Redis achieves real-time performance with linear scaling to hundreds of millions of operations per second.
For instance, we can reduce the overall data store complexity if we take advantage of Redis Enterprise in our event-driven microservices architecture (see Figure 2).
Now, as a highly performant system, we address our need for real-time data by incorporating dynamic information into the risk calculation without compromising the need to complete this step in line with the transaction. Similarly, we can return to the is-what-I-need-at-the-store problem with real-time inventory and, assuming there is only one store or everything is centralized, we can be confident that latency will not create false results.
Complexity of multiple systems
Which takes us to the next level: What happens when there is more than one store and there is no centralization, but instead multiple inventory systems? If we do nothing, we have a new level of consistency challenge. Fortunately, we can address this with active-active replication and create master data across the entire set of systems. Active-active, implemented as a conflict-free replicated database, is illustrated in Figure 3 and refers to using at least two data centers that each can service an application at any time and deliver application accessibility even if parts of the network or servers fail unexpectedly. See Active-Active Geo-Distribution (CRDTs-Based) for a deeper dive.
Unfortunately, if my applications are not colocated with the data, I’m still facing an average of 100 ms of latency between where the data is needed and where the data lives; latency can still be the source of an outage.
We can solve this challenge by bringing the data closer to where it is being used. If the public cloud is too far away at 100 ms, then I can leverage network edge services such as Verizon 5G Edge with AWS Wavelength to reduce the logical distance between the application and the data and thereby decrease the latency to around 25 to 30ms with 5G Ultra Wideband.
Our distributed system is now modernized with an event-based microservices architecture, where we have reduced the data store complexity, created data consistency with active-active replication and reduced latency to meet real-time requirements. Furthermore, I can bring that architecture closer to where it’s being consumed and reduce the impact of latency as a source of an outage. I can leverage network edges with active-active replication so that I am now consistent in ensuring that everywhere the data lives is close to where it is being used. Latency — as well as the risk or perception or reality of an outage — is minimized.
Interested in learning about more modern approaches to reducing latency in complex distributed systems? Read our e-book, Latency is the New Outage — because you don’t have the luxury of time.
The author is solely responsible for the content. Its inclusion does not imply endorsement by Verizon of the content, the third party or its products or services.