Throttling

The throttling policy controls traffic to LLM APIs based on token consumption, request count, and concurrency. Configure rules by consumer, request header, query parameter, cookie, client IP, or model name. A global limit provides API-wide fallback protection against overload and abuse.

Features

Prevent resource overload: Limit requests by consumer, header, query parameter, cookie, client IP, or model name to prevent overloads. Combine with a caching policy for better performance.
Dynamic traffic control: Apply throttling per second, minute, hour, or day to maintain stability under high concurrency.
Multiple matching rules: Supports exact match, prefix match, regex match, and any match for flexible rule targeting.
Multiple throttling modes: Throttle by token consumption, request count, or concurrency.
Fine-grained model control: Set different thresholds per model name to protect high-cost resources.
Global limits: Apply an API-wide fallback limit on total token consumption, request count, and concurrency.
Prevent malicious attacks: Throttle specific consumers, headers, query parameters, cookies, or client IPs to block crawlers and automated abuse.

Use cases

High-concurrency scenarios: During e-commerce promotions, limit token consumption per time period to prevent malicious high-frequency calls and ensure stability and fairness.
AI service calls: Throttle calls to your LLM APIs to prevent service degradation or system crashes caused by traffic bursts.
Multi-tenant systems: Assign separate quotas per tenant to ensure fairness and resource isolation.
Fine-grained model control: Set per-model limits (for example, different quotas for qwen-max and qwen-plus) to protect high-cost resources.
Global traffic protection: Cap total token consumption, request count, and concurrency across the entire API.
Malicious attack protection: Defend against crawlers, DDoS attacks, and API abuse.

Procedure

Go to the Instances page in the AI Gateway Console. In the top menu bar, select the region where your target instance is located, and then click the target instance ID.
In the left-side navigation pane, click Model API, and then click the target API Name to go to the API Details page.
Click Policies and Plug-ins, then turn on the Throttling switch and configure the related parameters.

Note A request can match a maximum of 10 rules simultaneously.

Throttling policy

Each rule consists of a Condition, Throttling rule, Time window, Limit, and Limit unit. Drag rules to reorder them. Click Add to create a new rule.

Parameter	Description
Throttling	Turns the throttling policy on or off. Off by default.
Condition	Dimension for throttling. Options: By consumer, By request header, By request query, By request cookie, By client IP, and By model.
Throttling rules	Depending on the selected condition, enter the matching rule, parameter name, and match content. Each dimension is described below.
Time window	Time window for throttling: Every second, Every minute, Every hour, or Every day.
Limit	Throttling threshold.
Limit unit	Unit of measurement: token, request count, or concurrency.

By consumer

Throttles requests based on consumer identity. Ideal for multi-tenant systems.

Form items: Condition (By consumer) → Matching rule (exact match/prefix match/regex match/any match) → Consumer selection → Time window → Limit + Limit unit
Consumer selection: Select from existing consumers or click Create Consumer to add one. If you select any match, no specific consumer is required.
Example: Limit any consumer to 1,000 tokens per minute.

Important To configure throttling by consumer, you must first enable consumer authentication.

By request header

Throttles requests based on a specified field in the request header.

Form items: Condition (By request header) → Parameter name (Header field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the header field name to match.
Match content: This field is not required if you select any match.
Example: For requests where the header x-user-level has a value of beta, set a limit of 100 tokens per minute.

By request query

Throttles requests based on a query parameter in the request URL.

Form items: Condition (By request query) → Parameter name (Query parameter name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the query parameter name to match.
Match content: This field is not required if you select any match.
Example: For requests with the query parameter user_id=1, set a limit of 100 tokens per minute.

By request cookie

Throttles requests based on a specified field in the request cookie.

Form items: Condition (By request cookie) → Parameter name (Cookie field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit
Parameter name: Required. Enter the cookie field name to match.
Match content: This field is not required if you select any match.
Example: For requests containing a cookie with a specific identifier, set a limit of 100 tokens per minute.

By client IP

Throttles requests based on client IP. Supports both individual IPs and IP ranges.

Form items: Condition (By client IP) → IP address → Limit + Limit unit
IP address: Enter an IP address (for example, 192.168.1.1) or CIDR range (for example, 192.168.1.0/24). To match all client IPs, enter 0.0.0.0/0.
Example: Set the maximum concurrency to 50 for each client IP.

Note Throttling by client IP has a simplified configuration and does not require selecting a matching rule or time window.

By model

Sets separate throttling thresholds for specific model names. Suitable for services that use multiple models.

Form items: Condition (By model) → Matching rule (fixed to exact match) → Model name → Limit + Limit unit
Model name: Required. Enter the name of the target model to throttle.
Limit unit: Supports token, request count, and concurrency.
Example: For the qwen-max model, set a limit of 500 tokens per minute and a maximum concurrency of 10.

Note Throttling by model always uses an exact match. If you need more flexible model matching, such as prefix or regex matching, use the "By request header" condition and manually specify the parameter name as x-higress-llm-model.

Global limit

The global limit is a fallback policy that applies to the entire API, independent of condition-based rules.

How to enable: In the Global limit section, select the Enable checkbox.
Form items: Time window (Every second/Every minute/Every hour/Every day) → Limit + Limit unit (token/request count/concurrency)
Add multiple rules: Click Add to create multiple global limit rules with different throttling modes.
Example: For the entire API, set a maximum of 10,000 tokens per minute, 100 requests per minute, and a maximum concurrency of 20.

Note The global limit is a fallback policy that supplements the standard rules and can be used with them simultaneously. Throttling is triggered if traffic matches any rule.

Confirm your configurations and click Save.

Matching rules

The conditions By consumer, By request header, By request query, and By request cookie support the following four matching rules. Priority: exact match > prefix match > regex match > any match.

Matching rule	Description	Example
Exact match	The value must be an exact match.	The header `x-user-level` is exactly `beta`.
Prefix match	The value must start with the specified prefix.	The header `x-user-level` starts with `vip`.
Regex match	The value must match the specified regular expression.	The header `x-user-level` matches `^(gold\|silver)$`.
Any match	Matches any value for the given dimension. No match content is required.	Applies to any consumer.

Note If you configure multiple rules, a request is throttled if it matches any of them. The By client IP and By model conditions use fixed matching methods and do not require manual rule selection.

Limit units and modes

For each throttling rule, select a limit unit that corresponds to a different metering method:

Limit unit	Description	Applicable conditions
Token	Measures the total inbound and outbound tokens consumed.	All conditions
Request count	Measures the total number of requests.	All conditions
Concurrency	Measures the number of simultaneous requests.	All conditions

The throttling policy supports the following time windows, used in combination with a limit unit:

Time window	Token throttling	Request count throttling	Concurrency throttling
Every second	Maximum tokens per second.	Maximum requests per second.	—
Every minute	Maximum tokens per minute.	Maximum requests per minute.	—
Every hour	Maximum tokens per hour.	Maximum requests per hour.	—
Every day	Maximum tokens per day.	Maximum requests per day.	—
(No time window)	—	—	Maximum simultaneous requests.

Note

Time window rate limiting uses a fixed time window with a specific Time to Live (TTL) for counting. For this method, each combination of a rate limiting dimension and a matching value corresponds to an independent rate limiting key. The timer for a window starts when the first valid data is recorded for its key. This window is not chunked based on natural time boundaries, such as 1:00 to 2:00, and is not a sliding window. The window automatically resets after it expires. Subsequent requests do not refresh the expiration time of the current window. For example, if the first data for a key with hourly rate limiting is recorded at 01:23:10, its window resets at approximately 02:23:10, not at 02:00:00.

Note Concurrency throttling does not require a time window. You can set the maximum concurrency directly.

Configuration examples

Example 1: Throttle by consumer token and IP concurrency

Configure two rules: limit any consumer to 1,000 tokens per minute and set the maximum concurrency for each client IP to 50.

No.	Condition	Matching rule	Parameter name/match content	Time window	Limit	Limit unit
1	By consumer	any match	—	Every minute	1000	token
2	By client IP	—	0.0.0.0/0	—	50	concurrency

Example 2: Throttle by request header

Set different throttling rules for different header values:

No.	Condition	Parameter name	Matching rule	Match content	Time window	Limit	Limit unit
1	By request header	x-user-level	exact match	beta	Every minute	100	token
2	By request header	x-user-level	prefix match	vip	Every hour	5000	token
3	By request header	x-app-id	any match	—	Every minute	50	request count

Example 3: Limit by model name

Set different limits for different models: limit qwen-max to 500 tokens per minute and a maximum concurrency of 10, and limit qwen-plus to 2,000 tokens per minute.

No.	Condition	Matching rule	Model name	Time window	Limit	Limit unit
1	By model	exact match	qwen-max	Every minute	500	token
2	By model	exact match	qwen-plus	Every minute	2000	token
3	By model	exact match	qwen-max	—	10	concurrency

Example 4: Combine global and consumer limits

In addition to the consumer-based limit, enable a global limit as a fallback policy:

Throttling policy (Standard rule):

No.	Condition	Matching rule	Time window	Limit	Limit unit
1	By consumer	any match	Every minute	1000	token

Global limit (Enable checkbox selected):

No.	Time window	Limit	Limit unit
1	Every minute	10000	token
2	Every minute	100	request count
3	—	20	concurrency

Note The global limit is a fallback policy that supplements the standard rules. It must be enabled and configured separately.

Example 5: Mixed multi-dimensional rules

Configure throttling rules based on a variety of conditions simultaneously:

No.	Condition	Parameter name	Matching rule	Match content/IP/model name	Time window	Limit	Limit unit
1	By consumer	—	exact match	consumer-001	Every minute	500	token
2	By request header	x-user-level	prefix match	vip	Every hour	10000	token
3	By request query	user_id	regex match	^[0-9]+$	Every minute	200	request count
4	By request cookie	session	any match	—	Every minute	100	request count
5	By client IP	—	—	192.168.1.0/24	—	30	concurrency
6	By model	—	exact match	qwen-max	Every minute	1000	token

FAQ

Q: What is the maximum number of throttling rules I can configure?

A: A request can match up to 10 rules simultaneously. Combine rules across dimensions as needed, but fewer rules yield better performance.

Q: How do multiple rules interact with each other?

A: Rules use OR logic—throttling triggers when any rule matches. Rules with the same condition and match key merge into a single rule group.

Q: Can I use a global limit and standard throttling rules at the same time?

A: Yes. The global limit is a fallback that applies to the entire API without distinguishing keys, while standard rules provide fine-grained control per dimension. Throttling triggers if any limit is exceeded.

Q: What is the difference between throttling by model and by request header?

A: Throttling by model automatically converts to an exact match on the x-higress-llm-model header. For prefix or regex matching, use "By request header" with parameter name x-higress-llm-model.

Q: Can I combine the token, request count, and concurrency limit units?

A: Yes. Add multiple rules for the same condition with different limit units (for example, a per-minute token limit and a max concurrency limit for one model). They count independently, and throttling triggers if any limit is reached.

Q: Why are there no options for matching rule and time window when throttling by client IP?

A: Client IP throttling is simplified—enter an IP address or range and the system handles matching automatically. To apply a limit to all client IPs, enter 0.0.0.0/0.

Q: Does the order of the rules affect throttling behavior?

A: Rules support drag-and-drop sorting. However, since rules use OR logic (throttling triggers if any rule matches), order does not affect behavior.

Q: How long does it take for updated throttling configurations to take effect?

A: Changes typically take effect within seconds after the system pushes the new rules to the gateway data plane.

Q: How accurate is throttling in a distributed architecture?

A: In distributed systems, throttling counts may have slight deviations. The actual allowed requests may differ from the configured limit depending on request volume, rate, and backend latency.