Throttling

更新时间:
复制 MD 格式

The throttling policy controls traffic to LLM APIs based on token consumption, request count, and concurrency. Configure rules by consumer, request header, query parameter, cookie, client IP, or model name. A global limit provides API-wide fallback protection against overload and abuse.

Features

  • Prevent resource overload: Limit requests by consumer, header, query parameter, cookie, client IP, or model name to prevent overloads. Combine with a caching policy for better performance.

  • Dynamic traffic control: Apply throttling per second, minute, hour, or day to maintain stability under high concurrency.

  • Multiple matching rules: Supports exact match, prefix match, regex match, and any match for flexible rule targeting.

  • Multiple throttling modes: Throttle by token consumption, request count, or concurrency.

  • Fine-grained model control: Set different thresholds per model name to protect high-cost resources.

  • Global limits: Apply an API-wide fallback limit on total token consumption, request count, and concurrency.

  • Prevent malicious attacks: Throttle specific consumers, headers, query parameters, cookies, or client IPs to block crawlers and automated abuse.

Use cases

  • High-concurrency scenarios: During e-commerce promotions, limit token consumption per time period to prevent malicious high-frequency calls and ensure stability and fairness.

  • AI service calls: Throttle calls to your LLM APIs to prevent service degradation or system crashes caused by traffic bursts.

  • Multi-tenant systems: Assign separate quotas per tenant to ensure fairness and resource isolation.

  • Fine-grained model control: Set per-model limits (for example, different quotas for qwen-max and qwen-plus) to protect high-cost resources.

  • Global traffic protection: Cap total token consumption, request count, and concurrency across the entire API.

  • Malicious attack protection: Defend against crawlers, DDoS attacks, and API abuse.

Procedure

  1. Go to the Instances page in the AI Gateway Console. In the top menu bar, select the region where your target instance is located, and then click the target instance ID.

  2. In the left-side navigation pane, click Model API, and then click the target API Name to go to the API Details page.

  3. Click Policies and Plug-ins, then turn on the Throttling switch and configure the related parameters.

Note A request can match a maximum of 10 rules simultaneously.

Throttling policy

Each rule consists of a Condition, Throttling rule, Time window, Limit, and Limit unit. Drag rules to reorder them. Click Add to create a new rule.

Parameter

Description

Throttling

Turns the throttling policy on or off. Off by default.

Condition

Dimension for throttling. Options: By consumer, By request header, By request query, By request cookie, By client IP, and By model.

Throttling rules

Depending on the selected condition, enter the matching rule, parameter name, and match content. Each dimension is described below.

Time window

Time window for throttling: Every second, Every minute, Every hour, or Every day.

Limit

Throttling threshold.

Limit unit

Unit of measurement: token, request count, or concurrency.

By consumer

Throttles requests based on consumer identity. Ideal for multi-tenant systems.

  • Form items: Condition (By consumer) → Matching rule (exact match/prefix match/regex match/any match) → Consumer selection → Time window → Limit + Limit unit

  • Consumer selection: Select from existing consumers or click Create Consumer to add one. If you select any match, no specific consumer is required.

  • Example: Limit any consumer to 1,000 tokens per minute.

Important To configure throttling by consumer, you must first enable consumer authentication.

By request header

Throttles requests based on a specified field in the request header.

  • Form items: Condition (By request header) → Parameter name (Header field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit

  • Parameter name: Required. Enter the header field name to match.

  • Match content: This field is not required if you select any match.

  • Example: For requests where the header x-user-level has a value of beta, set a limit of 100 tokens per minute.

By request query

Throttles requests based on a query parameter in the request URL.

  • Form items: Condition (By request query) → Parameter name (Query parameter name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit

  • Parameter name: Required. Enter the query parameter name to match.

  • Match content: This field is not required if you select any match.

  • Example: For requests with the query parameter user_id=1, set a limit of 100 tokens per minute.

By request cookie

Throttles requests based on a specified field in the request cookie.

  • Form items: Condition (By request cookie) → Parameter name (Cookie field name) → Matching rule (exact match/prefix match/regex match/any match) → Match content → Time window → Limit + Limit unit

  • Parameter name: Required. Enter the cookie field name to match.

  • Match content: This field is not required if you select any match.

  • Example: For requests containing a cookie with a specific identifier, set a limit of 100 tokens per minute.

By client IP

Throttles requests based on client IP. Supports both individual IPs and IP ranges.

  • Form items: Condition (By client IP) → IP address → Limit + Limit unit

  • IP address: Enter an IP address (for example, 192.168.1.1) or CIDR range (for example, 192.168.1.0/24). To match all client IPs, enter 0.0.0.0/0.

  • Example: Set the maximum concurrency to 50 for each client IP.

Note Throttling by client IP has a simplified configuration and does not require selecting a matching rule or time window.

By model

Sets separate throttling thresholds for specific model names. Suitable for services that use multiple models.

  • Form items: Condition (By model) → Matching rule (fixed to exact match) → Model name → Limit + Limit unit

  • Model name: Required. Enter the name of the target model to throttle.

  • Limit unit: Supports token, request count, and concurrency.

  • Example: For the qwen-max model, set a limit of 500 tokens per minute and a maximum concurrency of 10.

Note Throttling by model always uses an exact match. If you need more flexible model matching, such as prefix or regex matching, use the "By request header" condition and manually specify the parameter name as x-higress-llm-model.

Global limit

The global limit is a fallback policy that applies to the entire API, independent of condition-based rules.

  • How to enable: In the Global limit section, select the Enable checkbox.

  • Form items: Time window (Every second/Every minute/Every hour/Every day) → Limit + Limit unit (token/request count/concurrency)

  • Add multiple rules: Click Add to create multiple global limit rules with different throttling modes.

  • Example: For the entire API, set a maximum of 10,000 tokens per minute, 100 requests per minute, and a maximum concurrency of 20.

Note The global limit is a fallback policy that supplements the standard rules and can be used with them simultaneously. Throttling is triggered if traffic matches any rule.
  1. Confirm your configurations and click Save.

Matching rules

The conditions By consumer, By request header, By request query, and By request cookie support the following four matching rules. Priority: exact match > prefix match > regex match > any match.

Matching rule

Description

Example

Exact match

The value must be an exact match.

The header x-user-level is exactly beta.

Prefix match

The value must start with the specified prefix.

The header x-user-level starts with vip.

Regex match

The value must match the specified regular expression.

The header x-user-level matches ^(gold|silver)$.

Any match

Matches any value for the given dimension. No match content is required.

Applies to any consumer.

Note If you configure multiple rules, a request is throttled if it matches any of them. The By client IP and By model conditions use fixed matching methods and do not require manual rule selection.

Limit units and modes

For each throttling rule, select a limit unit that corresponds to a different metering method:

Limit unit

Description

Applicable conditions

Token

Measures the total inbound and outbound tokens consumed.

All conditions

Request count

Measures the total number of requests.

All conditions

Concurrency

Measures the number of simultaneous requests.

All conditions

The throttling policy supports the following time windows, used in combination with a limit unit:

Time window

Token throttling

Request count throttling

Concurrency throttling

Every second

Maximum tokens per second.

Maximum requests per second.

Every minute

Maximum tokens per minute.

Maximum requests per minute.

Every hour

Maximum tokens per hour.

Maximum requests per hour.

Every day

Maximum tokens per day.

Maximum requests per day.

(No time window)

Maximum simultaneous requests.

Note

Time window rate limiting uses a fixed time window with a specific Time to Live (TTL) for counting. For this method, each combination of a rate limiting dimension and a matching value corresponds to an independent rate limiting key. The timer for a window starts when the first valid data is recorded for its key. This window is not chunked based on natural time boundaries, such as 1:00 to 2:00, and is not a sliding window. The window automatically resets after it expires. Subsequent requests do not refresh the expiration time of the current window. For example, if the first data for a key with hourly rate limiting is recorded at 01:23:10, its window resets at approximately 02:23:10, not at 02:00:00.

Note Concurrency throttling does not require a time window. You can set the maximum concurrency directly.

Configuration examples

Example 1: Throttle by consumer token and IP concurrency

Configure two rules: limit any consumer to 1,000 tokens per minute and set the maximum concurrency for each client IP to 50.

No.

Condition

Matching rule

Parameter name/match content

Time window

Limit

Limit unit

1

By consumer

any match

Every minute

1000

token

2

By client IP

0.0.0.0/0

50

concurrency

Example 2: Throttle by request header

Set different throttling rules for different header values:

No.

Condition

Parameter name

Matching rule

Match content

Time window

Limit

Limit unit

1

By request header

x-user-level

exact match

beta

Every minute

100

token

2

By request header

x-user-level

prefix match

vip

Every hour

5000

token

3

By request header

x-app-id

any match

Every minute

50

request count

Example 3: Limit by model name

Set different limits for different models: limit qwen-max to 500 tokens per minute and a maximum concurrency of 10, and limit qwen-plus to 2,000 tokens per minute.

No.

Condition

Matching rule

Model name

Time window

Limit

Limit unit

1

By model

exact match

qwen-max

Every minute

500

token

2

By model

exact match

qwen-plus

Every minute

2000

token

3

By model

exact match

qwen-max

10

concurrency

Example 4: Combine global and consumer limits

In addition to the consumer-based limit, enable a global limit as a fallback policy:

Throttling policy (Standard rule):

No.

Condition

Matching rule

Time window

Limit

Limit unit

1

By consumer

any match

Every minute

1000

token

Global limit (Enable checkbox selected):

No.

Time window

Limit

Limit unit

1

Every minute

10000

token

2

Every minute

100

request count

3

20

concurrency

Note The global limit is a fallback policy that supplements the standard rules. It must be enabled and configured separately.

Example 5: Mixed multi-dimensional rules

Configure throttling rules based on a variety of conditions simultaneously:

No.

Condition

Parameter name

Matching rule

Match content/IP/model name

Time window

Limit

Limit unit

1

By consumer

exact match

consumer-001

Every minute

500

token

2

By request header

x-user-level

prefix match

vip

Every hour

10000

token

3

By request query

user_id

regex match

^[0-9]+$

Every minute

200

request count

4

By request cookie

session

any match

Every minute

100

request count

5

By client IP

192.168.1.0/24

30

concurrency

6

By model

exact match

qwen-max

Every minute

1000

token

FAQ

Q: What is the maximum number of throttling rules I can configure?

A: A request can match up to 10 rules simultaneously. Combine rules across dimensions as needed, but fewer rules yield better performance.

Q: How do multiple rules interact with each other?

A: Rules use OR logic—throttling triggers when any rule matches. Rules with the same condition and match key merge into a single rule group.

Q: Can I use a global limit and standard throttling rules at the same time?

A: Yes. The global limit is a fallback that applies to the entire API without distinguishing keys, while standard rules provide fine-grained control per dimension. Throttling triggers if any limit is exceeded.

Q: What is the difference between throttling by model and by request header?

A: Throttling by model automatically converts to an exact match on the x-higress-llm-model header. For prefix or regex matching, use "By request header" with parameter name x-higress-llm-model.

Q: Can I combine the token, request count, and concurrency limit units?

A: Yes. Add multiple rules for the same condition with different limit units (for example, a per-minute token limit and a max concurrency limit for one model). They count independently, and throttling triggers if any limit is reached.

Q: Why are there no options for matching rule and time window when throttling by client IP?

A: Client IP throttling is simplified—enter an IP address or range and the system handles matching automatically. To apply a limit to all client IPs, enter 0.0.0.0/0.

Q: Does the order of the rules affect throttling behavior?

A: Rules support drag-and-drop sorting. However, since rules use OR logic (throttling triggers if any rule matches), order does not affect behavior.

Q: How long does it take for updated throttling configurations to take effect?

A: Changes typically take effect within seconds after the system pushes the new rules to the gateway data plane.

Q: How accurate is throttling in a distributed architecture?

A: In distributed systems, throttling counts may have slight deviations. The actual allowed requests may differ from the configured limit depending on request volume, rate, and backend latency.