This is a guest post by John Selvadurai, PhD. VP of Research & Development at Iterate.ai in partnership with AWS.
Iterate.ai is an enterprise AI platform company delivering innovative AI solutions to industries such as retail, finance, healthcare, and quick-service restaurants. Focused on driving productivity and helping businesses achieve their goals, Iterate specializes in implementing cutting-edge AI technologies. Among its standout offerings is Frontline, a workforce management platform powered by AI, designed to support and empower Frontline workers. Available on both the Apple App Store and Google Play, Frontline uses advanced AI tools to streamline operational efficiency and enhance communication among dispersed workforces.
In this post, we give an overview of durable semantic caching in Amazon MemoryDB, and share how Iterate used this functionality to accelerate and cost-optimize Frontline.
What is Frontline?
Frontline is a mobile-first platform equipped with various workforce management tools aimed at optimizing productivity, building community, and supporting day-to-day operations for Frontline employees.
It includes features such as task management, real-time communication tools, and a social space for building community within the workforce. These features not only enhance the efficiency of Frontline workers but also contribute to a more cohesive and engaged workforce.
One of the platform’s key highlights is its AI-powered agent, which acts as a conversational assistant for Frontline workers.
This AI agent enables employees to effortlessly access critical information on topics like daily tasks, operation manuals, HR policies, and other essential information directly from their mobile devices.
The AI agent is able to do this by using a technique called Retrieval Augmented Generation (RAG). RAG allows you to store your critical information within a vector store to retrieve relevant context and deliver it to a large language model (LLM) to provide more context-specific answers to the user.
In Frontline, Iterate chose to use Amazon Bedrock for access to our LLM of choice and Amazon OpenSearch Service for storage of our vectors that represent the critical context to answer users’ questions.
The challenge of high latency and increasing cost
When Frontline was deployed, the Frontline workers were experiencing high levels of latency, 8–10 seconds in some cases. This was unacceptable, because response time is crucial in a fast-paced work environment where Frontline workers constantly manage tasks and interact with customers.
Initially, the AI agent’s responses exhibited a p50 latency of 4–5 seconds, which, though acceptable for most scenarios, presented a significant drawback in situations requiring quick interactions.
This delay impacted the overall user experience, particularly when workers needed fast, real-time responses while serving customers or executing time-sensitive tasks.
Beyond this, Iterate saw user adoption increasing of their Frontline solution, and identified the LLM inferencing fees as bottlenecks to adoption. As they dove into the queries Frontline workers were asking, they noticed there was a high degree of redundancy, indicating they were unnecessarily invoking Amazon Bedrock and the OpenSearch Service.
To address this, they needed a solution that would enhance response speed without compromising the relevancy and reliability of the information provided by the AI agent while also helping to avoid invoking the LLM when possible.
Improving speed to single-digit milliseconds and saving money with a durable semantic cache in MemoryDB
MemoryDB is an ultra-fast, durable, in-memory database service compatible with Valkey and Redis OSS. It offers microsecond read latency and single-digit millisecond write latency, making it ideal for modern applications like those using microservices architectures. MemoryDB stores data in memory and uses a distributed transactional log for durability, allowing it to serve as a fully managed primary database without the need for separate caching or additional infrastructure management.
Durable semantic caching overview
In July 2024, MemoryDB introduced vector search capabilities, enabling durable semantic caching to help cost optimize and improve the performance of generative AI applications like Frontline. Semantic caching differs from traditional caching because it doesn’t rely on exact matches, but rather on the semantic similarity or the meaning of the text to retrieve results.
This feature uses the vector search capabilities of Amazon MemoryDB to allow you to store the vector representations of prior queries and their associated LLM-generated responses, such that when semantically similar questions are asked, the response comes from MemoryDB instead of generating the response from the LLM again.
In order to start, you create a vector index within your MemoryDB instance, specifying whether you want to store the vectors in JSON or HASH representations, the indexing algorithm of HNSW or no index with FLAT or exact search, the dimensionality of your associated embedding model, and your distance metric of choice (cosine similarity, dot product, or Euclidean). The following code snippet creates an vector index in an existing MemoryDB instance:
Once your index is created, you can start to load answers generated from the LLM and the vectors representing the questions and any associated numeric- or tag-based filters to ensure a level of personalization to the semantic cache hits you retrieve. In many cases, filtering matters as this allows for the answers that you retrieve from a semantic cache are only those that are relevant to the user asking the question. For example, a user asking a question from the United States, may not want the same answer as someone asking a question semantically similar from another country.
In addition to this, you can customize the similarity threshold or radius for your query. This value is fixed at the time of vector search and can range from 0
to 1
, with a recommended value of .2
, which you will need to tune to your applications needs. A lower similarity threshold or radius means questions must be more semantically similar, and as such – you would see less cache hits. This is a decision you must make based on your applications needs.
In this sample search query, you can see a country filter tag being applied as the ‘United States’ along with a similarity threshold or radius set to .2
.
They then added the question and answer for semantic caching:
The following code illustrates reading from the semantic cache, which did not require filtering:
Improved efficiency, reliability, and cost-effectiveness of Frontline
By integrating MemoryDB, Iterate reduced the response latency to single-digit milliseconds for frequently asked questions, allowing the AI agent to deliver near real-time responses even in high-throughput scenarios. This improvement translated into tangible benefits for Frontline’s users:
- Enhanced user experience – With single digit millisecond response times, Frontline workers can now access essential information on-demand, allowing them to handle customer inquiries and complete tasks without delay.
- Cost optimization – Durable semantic caching has significantly decreased their operational costs by reducing the load on Amazon Bedrock and OpenSearch Service for repeated queries, making the solution more cost-effective. They estimated that 70% of the requests are repeatable questions, and can be served through the caching layer through MemoryDB. This enabled them to experience up to 70% cost reduction in LLM calls.
- Reliable performance at scale – MemoryDB is able to handle high request volumes, so Frontline’s AI agent remains stable and responsive, even during peak usage times.
The following video showcases the dramatic performance difference in Frontline’s AI-powered workforce management platform with and without durable semantic caching. On the right, you’ll see a Frontline worker using the app without MemoryDB’s semantic caching, experiencing response times of 4-5 seconds or even up to 8-10 seconds. On the left, you’ll see the same interaction with MemoryDB’s durable semantic caching enabled, delivering responses in single-digit milliseconds. This side-by-side comparison illustrates how Iterate.ai optimized their solution to provide near real-time responses for Frontline workers who need quick access to critical information while serving customers or executing time-sensitive tasks.
Conclusion
In this post, we discussed how Iterate used MemoryDB to accelerate and cost-optimize Frontline. They saw an improvement from high seconds to single-digit milliseconds and a 70% cost savings with durable semantic caching with MemoryDB. We also walked through capabilities that MemoryDB allows to tune your semantic cache, such as through numeric- or tag-based filters, custom similarity thresholds or radius, and TTLs.
Try out durable semantic caching with Amazon MemoryDB for your own use cases, and share your thoughts in the comments.
About the authors
John Selvadurai has earned a PhD and three master’s degrees (an MBA, an MS in Computer Science, and an MS in Network Technologies) in addition to gaining startup and enterprise experience as a technology strategist and architect. At Iterate, John has been instrumental in the build-out of Interplay, Iterate.AI’s patented low-code middleware platform, as well as Iterate’s business strategy capabilities. In addition to his enterprise experience, John also spent time in startups serving as the technical architect for a payment solutions provider.
Sanjit Misra is a Senior Technical Product Manager on the Amazon ElastiCache and Amazon MemoryDB team, focused on generative AI and machine learning, based in Seattle, WA. For over 15 years, he has worked in product and engineering roles related to data, analytics, and AI/ML.
Source: Read More