Get started with custom detection agents-AI Guardrails(AI Guardrails)-阿里云帮助中心

AI Guardrails lets you configure and manage custom detection agents. This feature uses a large language model (LLM) and flexible custom prompts to quickly detect and filter content based on your business needs. This topic describes how to use the custom detection agent feature.

Step 1: Activate AI Guardrails

To activate the service, go to the AI Guardrails product page.

Step 2: Enable Customize Detection Agent

Log on to the AI Guardrails console .
In the navigation pane on the left, select Protection Configuration > Configuration. The following service is for services where the Large Language Model (LLM) uses text modality for both input and output:
AI input content moderation (query_security_check)
AI-generated content moderation (response_security_check)
Take AI input content moderation (query_security_check) as an example. In the Actions column, click Management to go to the Configuration page. If the Customize Detection Agent feature is disabled, enable it with one click on this page. This feature is billed separately. For more information, see Product Billing.

Step 3: Configure Customize Detection Agent

Go to the Customize Detection Agent page. On the Customize Detection Agent card, click Configuration Management in the lower-right corner.

Select large model: Select an LLM based on your specific moderation requirements. The selected LLM is invoked during the detection process. The available LLMs are:

Model Name	Model Description
Text Moderation LLM	This text moderation large language model is based on the Qwen foundational model and is supervised and fine-tuned for content safety scenarios. It can accurately identify specific compliance and governance-related risks.
Qwen3.0-Plus	A Qwen 3.0 series Plus model. It offers a balance of performance, speed, and cost. This model is suitable for complex scenarios that require high performance and can tolerate some latency.
Qwen3.0-Flash	A Qwen 3.0 series Flash model. It is fast and cost-effective, which makes it suitable for simple tasks.
Qwen3.6-Plus	A Qwen 3.6 series Plus model. It offers a balance of performance, speed, and cost. This model is suitable for complex scenarios that require high performance and can tolerate some latency.
Qwen3.6-Flash	A Qwen 3.6 series Flash model. It is fast and cost-effective, which makes it suitable for simple tasks.

Important

The selected LLM affects billing. Different LLMs have different billing methods. For more information, see Activation and billing overview.

Configure custom prompt:

Select a scenario template: The system provides preset templates for different scenarios. Each template supports different task objectives and detection tags. The available scenario template is as follows:
- Custom Tag Template: Supports the configuration of custom detection tags for general scenarios.

Detection configuration: Configure the required detection tags and their corresponding prompts based on your business needs. For each tag you add, define the corresponding Audit Tag and Description. Configuring multiple detection tags prompts the LLM to perform a multiclass classification task. Therefore, use accurate and concise language to describe the detection tag and detection prompt for each detection task.

Configuration description:
- Audit Tag: Specifies the name of the category to be detected. This is usually a noun phrase.
- Description: Specifies the detection criteria and rules for the LLM. It provides a detailed description of the detection tag's scope. You can provide one to three examples if necessary.

Configuration example:

Moderation Tag	Moderation Criteria
Off-site traffic diversion	Behavior that directs users to other platforms or channels outside the site through direct guidance or subtle hints (including variations and metaphors). This includes explicitly mentioning competitor platform names or their variations (such as common competitors like xx), mentioning other external platforms or their variations (such as common platforms like xx), or including explicit contact information.
Malicious negative review of brand xx	Unfounded malicious comparisons, false negative reviews targeting brand xx, or comments or statements that intentionally damage the brand or founder's image, such as false defamation or rumors against the founder. For example: "xx is all false advertising, much worse than brand xx."

Important

The character length of the custom prompt section (the total length of all detection tags and detection prompts) affects billing. Billing is calculated for every 3,000 characters in the custom section. If the length is less than 3,000 characters, it is billed as 3,000 characters.
Additionally, long prompts increase detection latency. To manage this, you can configure a maximum of 30 custom detection tags.

Model output format: This is preset and does not require configuration. For more information, see API reference.

During detection, the system combines the selected preset scenario template, the custom detection tag configurations, and the preset output format to create a complete prompt. This prompt is then used to call the selected LLM to get the moderation result. Using the example tags above, the complete combined prompt is as follows:

You are a senior ****** moderation expert, particularly skilled in ******. The business problem you face is ******, and the task objective is ******.
The tags to be moderated are as follows:
1. Off-site traffic diversion: Behavior that directs users to other platforms or channels outside the site through direct guidance or subtle hints (including variations and metaphors). This includes explicitly mentioning competitor platform names or their variations (such as common competitors like xx), mentioning other external platforms or their variations (such as common platforms like xx), or including explicit contact information.   
2. Malicious negative review of brand xx: Unfounded malicious comparisons, false negative reviews targeting brand xx, or comments or statements that intentionally damage the brand or founder's image, such as false defamation or rumors against the founder. For example: "xx is all false advertising, much worse than brand xx."   
3. ******. ******. Now, you are given a sample to moderate. Please determine if the text matches the scope of the tags above. Strictly follow the format below for the output: ******.

Step 4: Test the configuration

After you configure the Customize Detection Agent, test it before publishing. Publish the configuration only after the results meet your expectations. Click Test in the lower-left corner of the page to test the configuration. You can test a single text entry or multiple text entries (up to 10).

Note

The test feature on this page is free of charge. Each account can test up to 1,000 text entries per day.

Step 5: Publish the configuration

When the test results meet your expectations, click Publish to publish the Customize Detection Agent configuration online. After publishing, the configuration typically takes effect in the production environment within 2 to 5 minutes. Proceed with caution. After publishing, you can also use the online test feature to test the effect.

Step 6: Query results and view threat reports

In the navigation pane on the left, choose Test Results to view the detection results and threat reports for the custom detection agent.