AI Gateway enhances caching for repetitive AI requests with a dual-engine approach. By combining a Redis-based precise cache and the DashVector semantic cache, it reduces costs and improves the efficiency of large language model (LLM) calls. This document covers the features, benefits, and configuration of both cache policies.
Semantic cache concepts
Vector embedding
-
The semantic cache uses vector technology to match user intent. When a user makes a new request, the system first converts the text into a high-dimensional vector through text embedding. This process is called vector embedding.
-
These vectors accurately capture the semantic features of the text. For example, although "Apple phone" and "iPhone" are different words, their vectors are very close in the vector space. Unlike a traditional precise cache, which requires an exact text match, this vectorized representation allows the system to understand that "tomorrow's weather in Beijing" and "Beijing's weather forecast for the next 24 hours" are semantically similar.
Vector comparison
-
After generating the vectors, the system uses the cosine similarity algorithm to calculate the angle between the new request's vector and the vectors of cached items. If the similarity meets a preset threshold, the system returns a cached response. A threshold value between 0.8 and 0.9 is recommended, but you can adjust it as needed.
-
This mechanism lets the system intelligently identify synonymous expressions. For example, in an e-commerce customer service scenario, a user might ask, "When will this package arrive?" or "What is the estimated delivery time?". The system can match both questions to the same cached answer: "According to the logistics information, your package will be delivered before 3:00 PM tomorrow."
Vector database
AI Gateway uses DashVector to manage vectors efficiently. This type of vector database uses a Hierarchical Navigable Small World (HNSW) algorithm to retrieve millions of vectors in milliseconds and supports dynamic updates. Unlike a traditional precise cache, a semantic cache supports matching an infinite number of semantic variants and improves resource utilization by sharing storage space for semantically similar items.
Strategy comparison
|
Feature |
Precise cache |
Semantic cache |
|
Matching method |
Exact string match. |
Vector space distance determined by cosine similarity. |
|
Fault tolerance |
Does not recognize synonymous or near-synonymous expressions. |
Supports fuzzy matching for synonyms, sentence variations, and more. |
|
Response speed |
Millisecond-level (local key-value query). |
Millisecond-level (nearest neighbor search in a vector database). |
|
Typical scenarios |
Standard FAQs and API calls with fixed parameters. |
Natural language instructions, multi-turn conversations, and fuzzy queries. |
|
Cost-effectiveness |
Ideal for high-frequency, repetitive requests. |
Ideal for scenarios with high semantic diversity and similar user intents. |
Benefits
-
Dual-mode cache system: Provides the flexibility to choose between two cache modes and dynamically adjust the matching threshold based on your business needs.
-
Precise cache: Based on the Redis key-value storage architecture, this mode provides millisecond-level responses to identical requests.
-
Semantic cache: Uses the DashVector vector database (part of Alibaba Cloud Vector Retrieval Service) to intelligently match semantically similar requests. The similarity threshold is adjustable, overcoming the limitations of traditional string matching.
-
-
Reduced redundant computations: For identical AI requests, the system returns cached data directly, avoiding redundant calls to the LLM.
-
Improved performance: By quickly retrieving results from the cache, AI Gateway significantly reduces response times and the load on backend servers. The semantic cache enables "intent-level" responses, greatly improving user satisfaction and experience.
-
Expanded scenario coverage: Supports standard scenarios like customer service systems and knowledge base queries, as well as variants of natural language instructions such as "tomorrow's weather" and "weather forecast for the next 24 hours".
-
Log monitoring: Allows you to analyze cache hit ratio metrics.
Procedure
Log on to the AI Gateway console and choose Instance. In the top menu bar, select a region, then click the target instance ID.
In the navigation pane on the left, choose Model API, then click the target API name to go to the API Details page.
-
Click Policies and Plug-ins, enable the Cache switch, and configure the parameters.
AI Gateway has upgraded its caching capabilities to support Semantic and Exact Match. Select the appropriate cache mode based on the Strategy comparison.
Semantic cache
ImportantIf a request includes the
x-higress-skip-ai-cache: onheader, the request bypasses the cache. It is forwarded directly to the backend service, and the response is not cached.-
Cache Key Strategy: Select the default option, Latest Query Only, or select Integrate Historical Queries based on your requirements.
-
Text Vectorization Configuration:
-
AI Service: Select an existing AI service. If you do not have an AI service, click Create Service to create one.
-
Model Name: Select the name of the model that you want to use.
-
Timeout Period: Set the timeout period. The default value is 5,000 ms.
-
-
Vector Database Configuration:
-
Service Provider: If you have not activated DashVector, click Go to Console to open the service activation page. You will need to . Record the collection name for later use.
ImportantWhen you create a collection, select
Cosineas the distance metric, and ensure the vector dimension matches that of the text embedding model. To find the output vector dimensions of text embedding models on the Model Studio platform, see Text and Multimodal Embeddings. -
Service URL: Enter your DashVector service endpoint.
-
Collection Name: Enter the name of the collection that you created.
-
API Key: The access credential. For more information, see Manage API keys.
-
Vector Similarity Threshold: This value determines how strictly queries are matched to cached content. It ranges from 0 to 1, with a recommended value of 0.8 or 0.9. A higher value indicates greater semantic similarity. For more information, see Vector similarity threshold configuration.
-
Timeout Period: Set the timeout period. The default value is 3,000 ms.
-
Precise cache
Important-
Go to the Redis console and add the VPC CIDR block of the gateway instance to the whitelist.
-
If a request includes the
x-higress-skip-ai-cache: onheader, the request bypasses the cache. It is forwarded directly to the backend service, and the response is not cached.
-
Cache Key Strategy: Select the default option, Latest Query Only, or select Integrate Historical Queries based on your requirements.
-
Redis Cache Configuration:
-
Redis service URL: Enter your Redis service URL.
-
Port Number: Enter your port number.
-
Access Method: Select the method for accessing the Redis service. The options are Account+Password, Password-only, and Password-free.
-
Database Account: If you select the Account+Password access method, enter the database account.
-
Database Password: If you select an access method that requires a password, enter the database password.
-
Database No.: The specified database number.
-
Duration (seconds): The default cache duration is 1,800 seconds. During this period, if the API receives an identical AI request, the LLM is not called, and the cached response is returned directly.
-
-
-
Confirm the configuration and click Save.
Vector similarity threshold configuration
Core concepts
The vector similarity threshold is a key parameter that controls the matching sensitivity of the semantic cache by determining how strictly queries are matched to cached content.
-
Value range: A value from 0.0 (completely dissimilar) to 1.0 (completely similar).
-
Recommended range: 0.8 to 0.9 (adjustable based on business requirements). A value below 0.8 is not recommended.
-
Lower threshold (e.g., 0.75): A cached result is returned as long as the semantics are similar, even if the user's phrasing is different.
-
Higher threshold (e.g., 0.99): A cached result is returned only when the user uses an almost identical expression.
Why is a value of 0.8 or higher recommended?
When the threshold is lower than 0.8, the system may misjudge irrelevant queries as matches. This can lead to "false positives" (incorrectly returning irrelevant results), negatively affecting the user experience or business accuracy.
Comparison example
|
Example configuration |
Similarity |
Query example |
Description |
|
1.0 |
"When will my package arrive?" |
This query is the benchmark for comparison. |
|
0.89 |
"What is the estimated delivery time for my package?" |
Matches "When will my package arrive?". A cache hit is recorded. |
|
|
0.86 |
"Can my package be delivered today?" |
Matches "When will my package arrive?". A cache hit is recorded. |
|
|
0.83 |
"Where can my package be delivered today?" |
Does not match. A cache miss is recorded. |
Tuning suggestions
-
Benchmark testing: Start with the default value of 0.8 and gradually adjust the threshold to observe changes in the cache hit ratio.
-
Scenario adaptation:
-
For time-sensitive queries, such as real-time logistics tracking, a higher similarity threshold is recommended to ensure accuracy.
-
For scenarios that require standardized answers, such as FAQ responses, a slightly lower similarity threshold can be used to capture more variations of a question.
-
-
Performance balance: Raising the threshold improves matching accuracy but reduces the cache hit ratio, thereby increasing the number of LLM calls.
FAQs
Q: How do I determine the optimal threshold?
A: Use A/B testing to compare:
-
The cache hit ratio versus the LLM call cost.
-
The rate of user complaints about irrelevant cached answers (for example, "Why did my new question return an old answer?").
-
The response time fluctuations for key queries (for example, real-time queries versus historical queries).
Re-evaluate the threshold settings periodically (for example, monthly) based on the latest business data. During peak hours, consider lowering the threshold to handle a surge in query volume.
Cache hit ratio
You must activate Simple Log Service before you can query the cache hit ratio. You can query only data generated after Simple Log Service is activated.
You can query the cache hit ratio at the gateway level using the following search statement:
cluster_id:{your-gatewayId} and inner-ai-cache-{your-gatewayId} | SELECT
SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' THEN 1 ELSE 0 END) AS hit_count,
SUM(CASE WHEN content LIKE '%cache miss for key%' OR content LIKE '%score not meet the threshold%' THEN 1 ELSE 0 END) AS miss_count,
SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' THEN 1 ELSE 0 END) * 100.0 /
NULLIF(SUM(CASE WHEN content LIKE '%cache hit for key%' OR content LIKE '%key accepted%' OR content LIKE '%cache miss for key%'
OR content LIKE '%score not meet the threshold%' THEN 1 ELSE 0 END), 0) AS hit_rate
-
Replace
{your-gatewayId}with your gateway instance ID. Note that you must retain thegw-prefix for the first replacement, but not for the second. If you access the plug-in log query system from the cache switch button, the query box automatically contains acluster_idfilter. In this case, you only need to paste the rest of the query statement after the filter. On the Monitoring (AI Cache) page, click View Logs to go to the log query page. The log source isapig-plugin-log. Enter the preceding query statement into the query bar and run it. -
The query returns the cache hit ratio.
For example, the result might show that
hit_count(cache hits) is 7,miss_count(cache misses) is 6, andhit_rate(hit ratio) is approximately 53.85%.
Examples
-
When precise cache is enabled, only identical queries are matched. For example, if a user sends the query "Who are you?" and it is not served from the cache, the model consumes tokens as usual (input tokens: 127, output tokens: 40). If the same query hits the precise cache (indicated by from-cache), both input and output tokens are 0, and the cached result is returned directly. This significantly reduces token consumption.
-
When semantic cache is enabled, semantically similar queries can also be matched. Queries with a semantic similarity below the threshold will not be matched.
When a semantic cache hit occurs, the cached response is reused for similar questions. Both input and output tokens are 0, consuming no model inference resources.
For example, if a user asks
Where can my package be delivered today?, and this query differs significantly in semantics from the cached questions, a cache miss occurs. Theqwen-maxmodel then generates a response and consumes tokens as usual.