The throttling policy controls traffic to LLM APIs based on token consumption, request count, and concurrency. Configure rules by consumer, request header, query parameter, cookie, client IP, or model name. A global limit provides API-wide fallback protection against overload and abuse.
Features
Prevent resource overload: Limit requests by consumer, header, query parameter, cookie, client IP, or model name to prevent overloads. Combine with a caching policy for better performance.
Dynamic traffic control: Apply throttling per second, minute, hour, or day to maintain stability under high concurrency.
Multiple matching rules: Supports exact match, prefix match, regex match, and any match for flexible rule targeting.
Multiple throttling modes: Throttle by token consumption, request count, or concurrency.
Fine-grained model control: Set different thresholds per model name to protect high-cost resources.
Global limits: Apply an API-wide fallback limit on total token consumption, request count, and concurrency.
Prevent malicious attacks: Throttle specific consumers, headers, query parameters, cookies, or client IPs to block crawlers and automated abuse.
Use cases
High-concurrency scenarios: During e-commerce promotions, limit token consumption per time period to prevent malicious high-frequency calls and ensure stability and fairness.
AI service calls: Throttle calls to your LLM APIs to prevent service degradation or system crashes caused by traffic bursts.
Multi-tenant systems: Assign separate quotas per tenant to ensure fairness and resource isolation.
Fine-grained model control: Set per-model limits (for example, different quotas for
qwen-maxandqwen-plus) to protect high-cost resources.Global traffic protection: Cap total token consumption, request count, and concurrency across the entire API.
Malicious attack protection: Defend against crawlers, DDoS attacks, and API abuse.
Procedure
Go to the Instances page in the AI Gateway Console. In the top menu bar, select the region where your target instance is located, and then click the target instance ID.
In the left-side navigation pane, click Model API, and then click the target API Name to go to the API Details page.
Click Policies and Plug-ins, then turn on the Throttling switch and configure the related parameters.
Note A request can match a maximum of 10 rules simultaneously.
Throttling policy
Each rule consists of a Condition, Throttling rule, Time window, Limit, and Limit unit. Drag rules to reorder them. Click Add to create a new rule.
Parameter | Description |
Throttling | Turns the throttling policy on or off. Off by default. |
Condition | Dimension for throttling. Options: By consumer, By request header, By request query, By request cookie, By client IP, and By model. |
Throttling rules | Depending on the selected condition, enter the matching rule, parameter name, and match content. Each dimension is described below. |
Time window | Time window for throttling: Every second, Every minute, Every hour, or Every day. |
Limit | Throttling threshold. |
Limit unit | Unit of measurement: token, request count, or concurrency. |
By consumer
Throttles requests based on consumer identity. Ideal for multi-tenant systems.
Form items: Condition (By consumer) → Matching rule (exact match/prefix match/regex match/any match) → Consumer selection → Time window → Limit + Limit unit
Consumer selection: Select from existing consumers or click Create Consumer to add one. If you select
any match, no specific consumer is required.Example: Limit any consumer to 1,000 tokens per minute.
Important To configure throttling by consumer, you must first enable consumer authentication.
By request header
Throttles requests based on a specified field in the request header.
Form items: Condition (By request header) → Parameter name (Header field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the header field name to match.
Match content: This field is not required if you select
any match.Example: For requests where the header
x-user-levelhas a value ofbeta, set a limit of 100 tokens per minute.
By request query
Throttles requests based on a query parameter in the request URL.
Form items: Condition (By request query) → Parameter name (Query parameter name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the query parameter name to match.
Match content: This field is not required if you select
any match.Example: For requests with the query parameter
user_id=1, set a limit of 100 tokens per minute.
By request cookie
Throttles requests based on a specified field in the request cookie.
Form items: Condition (By request cookie) → Parameter name (Cookie field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the cookie field name to match.
Match content: This field is not required if you select
any match.Example: For requests containing a cookie with a specific identifier, set a limit of 100 tokens per minute.
By client IP
Throttles requests based on client IP. Supports both individual IPs and IP ranges.
Form items: Condition (By client IP) → IP address → Limit + Limit unit
IP address: Enter an IP address (for example,
192.168.1.1) or CIDR range (for example,192.168.1.0/24). To match all client IPs, enter0.0.0.0/0.Example: Set the maximum concurrency to 50 for each client IP.
Note Throttling by client IP has a simplified configuration and does not require selecting a matching rule or time window.
By model
Sets separate throttling thresholds for specific model names. Suitable for services that use multiple models.
Form items: Condition (By model) → Matching rule (fixed to exact match) → Model name → Limit + Limit unit
Model name: Required. Enter the name of the target model to throttle.
Limit unit: Supports token, request count, and concurrency.
Example: For the
qwen-maxmodel, set a limit of 500 tokens per minute and a maximum concurrency of 10.
Note Throttling by model always uses an exact match. If you need more flexible model matching, such as prefix or regex matching, use the "By request header" condition and manually specify the parameter name as x-higress-llm-model.Global limit
The global limit is a fallback policy that applies to the entire API, independent of condition-based rules.
How to enable: In the Global limit section, select the Enable checkbox.
Form items: Time window (Every second/Every minute/Every hour/Every day) → Limit + Limit unit (token/request count/concurrency)
Add multiple rules: Click Add to create multiple global limit rules with different throttling modes.
Example: For the entire API, set a maximum of 10,000 tokens per minute, 100 requests per minute, and a maximum concurrency of 20.
Note The global limit is a fallback policy that supplements the standard rules and can be used with them simultaneously. Throttling is triggered if traffic matches any rule.
Confirm your configurations and click Save.
Matching rules
The conditions By consumer, By request header, By request query, and By request cookie support the following four matching rules. Priority: exact match > prefix match > regex match > any match.
Matching rule | Description | Example |
Exact match | The value must be an exact match. | The header |
Prefix match | The value must start with the specified prefix. | The header |
Regex match | The value must match the specified regular expression. | The header |
Any match | Matches any value for the given dimension. No match content is required. | Applies to any consumer. |
Note If you configure multiple rules, a request is throttled if it matches any of them. The By client IP and By model conditions use fixed matching methods and do not require manual rule selection.
Limit units and modes
For each throttling rule, select a limit unit that corresponds to a different metering method:
Limit unit | Description | Applicable conditions |
Token | Measures the total inbound and outbound tokens consumed. | All conditions |
Request count | Measures the total number of requests. | All conditions |
Concurrency | Measures the number of simultaneous requests. | All conditions |
The throttling policy supports the following time windows, used in combination with a limit unit:
Time window | Token throttling | Request count throttling | Concurrency throttling |
Every second | Maximum tokens per second. | Maximum requests per second. | — |
Every minute | Maximum tokens per minute. | Maximum requests per minute. | — |
Every hour | Maximum tokens per hour. | Maximum requests per hour. | — |
Every day | Maximum tokens per day. | Maximum requests per day. | — |
(No time window) | — | — | Maximum simultaneous requests. |
Time window rate limiting uses a fixed time window with a specific Time to Live (TTL) for counting. For this method, each combination of a rate limiting dimension and a matching value corresponds to an independent rate limiting key. The timer for a window starts when the first valid data is recorded for its key. This window is not chunked based on natural time boundaries, such as 1:00 to 2:00, and is not a sliding window. The window automatically resets after it expires. Subsequent requests do not refresh the expiration time of the current window. For example, if the first data for a key with hourly rate limiting is recorded at 01:23:10, its window resets at approximately 02:23:10, not at 02:00:00.
Note Concurrency throttling does not require a time window. You can set the maximum concurrency directly.
Configuration examples
Example 1: Throttle by consumer token and IP concurrency
Configure two rules: limit any consumer to 1,000 tokens per minute and set the maximum concurrency for each client IP to 50.
No. | Condition | Matching rule | Parameter name/match content | Time window | Limit | Limit unit |
1 | By consumer | any match | — | Every minute | 1000 | token |
2 | By client IP | — | 0.0.0.0/0 | — | 50 | concurrency |
Example 2: Throttle by request header
Set different throttling rules for different header values:
No. | Condition | Parameter name | Matching rule | Match content | Time window | Limit | Limit unit |
1 | By request header | x-user-level | exact match | beta | Every minute | 100 | token |
2 | By request header | x-user-level | prefix match | vip | Every hour | 5000 | token |
3 | By request header | x-app-id | any match | — | Every minute | 50 | request count |
Example 3: Limit by model name
Set different limits for different models: limit qwen-max to 500 tokens per minute and a maximum concurrency of 10, and limit qwen-plus to 2,000 tokens per minute.
No. | Condition | Matching rule | Model name | Time window | Limit | Limit unit |
1 | By model | exact match | qwen-max | Every minute | 500 | token |
2 | By model | exact match | qwen-plus | Every minute | 2000 | token |
3 | By model | exact match | qwen-max | — | 10 | concurrency |
Example 4: Combine global and consumer limits
In addition to the consumer-based limit, enable a global limit as a fallback policy:
Throttling policy (Standard rule):
No. | Condition | Matching rule | Time window | Limit | Limit unit |
1 | By consumer | any match | Every minute | 1000 | token |
Global limit (Enable checkbox selected):
No. | Time window | Limit | Limit unit |
1 | Every minute | 10000 | token |
2 | Every minute | 100 | request count |
3 | — | 20 | concurrency |
Note The global limit is a fallback policy that supplements the standard rules. It must be enabled and configured separately.
Example 5: Mixed multi-dimensional rules
Configure throttling rules based on a variety of conditions simultaneously:
No. | Condition | Parameter name | Matching rule | Match content/IP/model name | Time window | Limit | Limit unit |
1 | By consumer | — | exact match | consumer-001 | Every minute | 500 | token |
2 | By request header | x-user-level | prefix match | vip | Every hour | 10000 | token |
3 | By request query | user_id | regex match | ^[0-9]+$ | Every minute | 200 | request count |
4 | By request cookie | session | any match | — | Every minute | 100 | request count |
5 | By client IP | — | — | 192.168.1.0/24 | — | 30 | concurrency |
6 | By model | — | exact match | qwen-max | Every minute | 1000 | token |
FAQ
Q: What is the maximum number of throttling rules I can configure?
A: A request can match up to 10 rules simultaneously. Combine rules across dimensions as needed, but fewer rules yield better performance.
Q: How do multiple rules interact with each other?
A: Rules use OR logic—throttling triggers when any rule matches. Rules with the same condition and match key merge into a single rule group.
Q: Can I use a global limit and standard throttling rules at the same time?
A: Yes. The global limit is a fallback that applies to the entire API without distinguishing keys, while standard rules provide fine-grained control per dimension. Throttling triggers if any limit is exceeded.
Q: What is the difference between throttling by model and by request header?
A: Throttling by model automatically converts to an exact match on the x-higress-llm-model header. For prefix or regex matching, use "By request header" with parameter name x-higress-llm-model.
Q: Can I combine the token, request count, and concurrency limit units?
A: Yes. Add multiple rules for the same condition with different limit units (for example, a per-minute token limit and a max concurrency limit for one model). They count independently, and throttling triggers if any limit is reached.
Q: Why are there no options for matching rule and time window when throttling by client IP?
A: Client IP throttling is simplified—enter an IP address or range and the system handles matching automatically. To apply a limit to all client IPs, enter 0.0.0.0/0.
Q: Does the order of the rules affect throttling behavior?
A: Rules support drag-and-drop sorting. However, since rules use OR logic (throttling triggers if any rule matches), order does not affect behavior.
Q: How long does it take for updated throttling configurations to take effect?
A: Changes typically take effect within seconds after the system pushes the new rules to the gateway data plane.
Q: How accurate is throttling in a distributed architecture?
A: In distributed systems, throttling counts may have slight deviations. The actual allowed requests may differ from the configured limit depending on request volume, rate, and backend latency.