Guardrails service for Alibaba Cloud Model Studio users

更新时间:
复制 MD 格式

This service is designed for users of the Model Studio platform. It enhances security moderation for the text input and output of Large Language Models (LLMs). The service complies with the core policies of the Model Studio platform and provides a flexible moderation tag management feature. This feature lets you enable or disable specific moderation tags based on your requirements. You can also use customized security policy configuration services to meet your specific needs.

Important

If you already use Guardrails in Model Studio and want to switch to Guardrails, contact your account manager.

Step 1: Activate the Guardrails service

Visit the Guardrails purchase page, create a Service-linked Role, and click Buy Now to activate the service.

Step 2: Authorize Guardrails in the Model Studio platform

  1. Log on to the Alibaba Cloud Model Studio console. Click the bailian icon in the upper-right corner, switch to the destination region, and then click Security Management in the navigation pane on the left.

  2. Click Go to Authorize and follow the on-screen instructions.

Step 3: Pass the relevant identity when you call Model Studio

Parameter description

When you call Alibaba Cloud Model Studio, set the following parameters in the request header (Header) to use the Guardrails moderation service.

{
    "X-DashScope-DataInspection": {
       "input": "cip",
       "output": "cip"
    }
}

Call examples

When making a call, set DASHSCOPE_API_KEY. For more information, see Obtain an API Key.
Currently, only the Python SDK and HTTP calls are supported.

OpenAI Python SDK

Request example

import os
from openai import OpenAI

try:
    client = OpenAI(
        # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )

    completion = client.chat.completions.create(
        model="qwen-plus",  # For a list of models, see https://help.aliyun.com/en/model-studio/getting-started/models
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': 'Give me a plan to rob a bank'}
            ],
        extra_headers={
        'X-DashScope-DataInspection': '{"input":"cip","output":"cip"}'
        }
    )
    print(completion.choices[0].message.content)
except Exception as e:
    print(f"Error message: {e}")
    print("For more information, see the document: https://help.aliyun.com/en/model-studio/developer-reference/error-code")

Response example

Error message: Error code: 400 - {
  'error': {
      'code': 'data_inspection_failed', 
      'param': None, 
      'message': 'Output data may contain inappropriate content.', 
      'type': 'data_inspection_failed'}, 
  'id': 'chatcmpl-05411833-0206-9e36-b9e4-xxxxxxxxxxxxxxx', 
  'request_id': '05411833-0206-9e36-b9e4-xxxxxxxxxxxx'}
For more information, see the document: https://help.aliyun.com/en/model-studio/developer-reference/error-code

DashScope Python SDK

Request example

import os
from dashscope import Generation

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Give me a plan to rob a bank'}
    ]
response = Generation.call(
    # If the environment variable is not configured, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model="qwen-plus", # qwen-plus is used as an example. You can replace it with another model name as needed. For a list of models, see https://help.aliyun.com/en/model-studio/getting-started/models
    messages=messages,
    headers={'X-DashScope-DataInspection': '{"input":"cip", "output":"cip"}'},
    result_format='message'
    )
print(response)

Response example

{
    "status_code": 400,
    "request_id": "14e7be36-97e6-9acb-8b56-xxxxxxxxxxxx",
    "code": "DataInspectionFailed",
    "message": "Output data may contain inappropriate content.",
    "output": null,
    "usage": null
}

OpenAI compatible - HTTP curl

Request example

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-DataInspection: {\"input\": \"cip\", \"output\": \"cip\"}" \
-d '{
    "model": "qwen-plus",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user", 
            "content": "Give me a plan to rob a bank"
        }
    ]
}'

Response example

{
    "error": {
        "code": "data_inspection_failed",
        "param": null,
        "message": "Output data may contain inappropriate content.",
        "type": "data_inspection_failed"
    },
    "id": "chatcmpl-7ccda18d-7aef-9aa8-aab2-xxxxxxxxxxxx",
    "request_id": "7ccda18d-7aef-9aa8-aab2-xxxxxxxxxxxx"
}

DashScope - HTTP curl

Example request

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-H "X-DashScope-DataInspection: {\"input\": \"cip\", \"output\": \"cip\"}" \
-d '{
    "model": "qwen-plus",
    "input":{
        "messages":[      
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Give me a plan to rob a bank"
            }
        ]
    },
    "parameters": {
        "result_format":"message"
    }
}'

Response example

{
    "code": "DataInspectionFailed",
    "message": "Output data may contain inappropriate content.",
    "request_id": "f4109865-bcb5-9e4d-8fa9-xxxxxxxxxxxx"
}

Guardrails service matching policy

When you pass the cip identity in the header of a call to Alibaba Cloud Model Studio, a Guardrails service is automatically matched based on the model version.

Service name

service

Applicable models

Feature description

Model Studio Input Guardrails Guardrail Pro

  • qwen_query_check_pro (Based on Qwen3Guard, recommended)

  • bl_query_guard_pro

Qwen-Max and Qwen-VL-Max series models

Detects baseline violations, such as pornography, politics, and violence, and harmful prompts. It also detects sensitive topics and leading prompts. In some scenarios, this service uses a moderation LLM to enhance detection accuracy.

Model Studio output content security guardrail_pro

  • qwen_response_check_pro (Based on Qwen3Guard, recommended)

  • bl_response_guard_pro

Qwen-Max and Qwen-VL-Max series models

Detects baseline violations, such as pornography, politics, and violence, and harmful prompts. It also detects abuse, bias, and harmful values that may be generated by AI. In some scenarios, this service uses a moderation LLM to enhance detection accuracy.

Model Studio: Input Guardrails Guardrail

  • qwen_query_check (Based on Qwen3Guard, recommended)

  • bl_query_guard

Other model series (excluding Qwen-Max and Qwen-VL-Max series)

Detects baseline violations, such as pornography, politics, and violence, and harmful prompts. It also detects sensitive topics and leading prompts.

Guardrails Guardrail for Model Studio Outputs

  • qwen_response_check (Based on Qwen3Guard, recommended)

  • bl_response_guard

Other model series (excluding Qwen-Max and Qwen-VL-Max series)

Detects baseline violations, such as pornography, politics, and violence, and harmful prompts. It also detects abuse, bias, and harmful values that may be generated by AI.

Model Studio input image security guardrail

bl_img_query_guard

The service mapping is independent of the model series.

Detects compliance risks in images that are used as input for LLMs on the Model Studio platform.

Model Studio output content security guardrail

bl_img_response_guard

The service mapping is independent of the model series.

Detects compliance risks in images that are generated as output by LLMs on the Model Studio platform.

Billing description

After you grant the service-linked role (SLR) authorization for Guardrails in the Model Studio console and enable the product policy, you are charged based on your actual usage. This service uses a pay-as-you-go billing method based on the number of tokens. Fees are settled daily based on your usage for that day. No fees are incurred if the service is not used. For more information about billing rules, see Billing overview.

Important

When you perform a single query or response check on the Model Studio platform, if the number of tokens in the text is less than 1,000, you are charged for 1,000 tokens. If the number of tokens is 1,000 or more, you are charged for the actual number of tokens.

Threat tags

Tag description

  1. Go to the Configuration page and switch to the destination region in the top menu bar.

  2. Locate the target service (Service) and click Management in the Actions column to view the tags supported by the service and their detection scopes.

    The following table describes the threat tag values, their corresponding score ranges, and their meanings:

    Tag value (label)

    Confidence score range (confidence)

    Chinese Meaning

    pornographic_adult

    0 to 100. A higher score indicates a higher confidence level.

    Suspected pornographic content

    sexual_terms

    0 to 100. A higher score indicates a higher confidence level.

    Suspected sexual health content

    sexual_prompts

    0 to 100. A higher score indicates a higher confidence level.

    Suspected prompts for generating pornographic content

    sexual_suggestive

    0 to 100. A higher score indicates a higher confidence level.

    Suspected vulgar content

    political_figure

    0 to 100. A higher score indicates a higher confidence level.

    Suspected political figures

    political_entity

    0 to 100. A higher score indicates a higher confidence level.

    Suspected political entities

    political_n

    0 to 100. A higher score indicates a higher confidence level.

    Suspected sensitive political content

    political_p

    0 to 100. A higher score indicates a higher confidence level.

    Suspected politically sensitive individuals

    political_prompts

    0 to 100. A higher score indicates a higher confidence level.

    Suspected prompts for generating political content

    political_a

    0 to 100. A higher score indicates a higher confidence level.

    Special upgrade for political content protection

    violent_extremist

    0 to 100. A higher score indicates a higher confidence level.

    Suspected extremist organizations

    violent_incidents

    0 to 100. A higher score indicates a higher confidence level.

    Suspected extremist content

    violent_weapons

    0 to 100. A higher score indicates a higher confidence level.

    Suspected weapons and ammunition

    violent_prompts

    0 to 100. A higher score indicates a higher confidence level.

    Suspected prompts for generating violent content

    contraband_drug

    0 to 100. A higher score indicates a higher confidence level.

    Suspected drug-related content

    contraband_gambling

    0 to 100. A higher score indicates a higher confidence level.

    Suspected gambling-related content

    contraband_act

    0 to 100. A higher score indicates a higher confidence level.

    Suspected prohibited acts

    contraband_entity

    0 to 100. A higher score indicates a higher confidence level.

    Suspected prohibited tools

    inappropriate_discrimination

    0 to 100. A higher score indicates a higher confidence level.

    Suspected biased or discriminatory content

    inappropriate_ethics

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content with harmful values

    inappropriate_profanity

    0 to 100. A higher score indicates a higher confidence level.

    Suspected abusive or insulting content

    inappropriate_oral

    0 to 100. A higher score indicates a higher confidence level.

    Suspected vulgar slang

    inappropriate_superstition

    0 to 100. A higher score indicates a higher confidence level.

    Suspected superstitious content

    inappropriate_nonsense

    0 to 100. A higher score indicates a higher confidence level.

    Suspected meaningless or spam content

    pt_to_sites

    0 to 100. A higher score indicates a higher confidence level.

    Suspected traffic diversion to external sites

    pt_by_recruitment

    0 to 100. A higher score indicates a higher confidence level.

    Suspected ads for online money-making or part-time jobs

    pt_to_contact

    0 to 100. A higher score indicates a higher confidence level.

    Ad account suspected of driving traffic

    religion_b

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content related to Buddhism

    religion_t

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content related to Taoism

    religion_c

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content related to Christianity

    religion_i

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content related to Islam

    religion_h

    0 to 100. A higher score indicates a higher confidence level.

    Suspected content related to Hinduism

    customized

    0 to 100. A higher score indicates a higher confidence level.

    Hit a custom dictionary

    ...

    ...

    ...

Manage tags

Except for some red-line control tags, you can enable or disable other risk tags in the console. Some risk tags provide settings for more granular detection scopes. For more information, see the Guardrails product console. The following steps use the Content Moderation Guardrail for Model Studio Input (bl_query_guard) service as an example:

  1. Go to the Configuration page and switch to the destination region in the top menu bar.

  2. Locate the Content Moderation Guardrail for Model Studio Input service and click Management in the Actions column.

  3. Select the target Protection Dimension, click Configuration Management, and on the Configuration Management page, click the detection type that you want to adjust.

  4. Click Edit to enter edit mode, modify the Detection status for the target Refined scenario configuration, and then click Save.

    Note

    The configuration changes take effect in about 2 to 5 minutes.

More operations

  • To customize content moderation rules, see Thesaurus Management.

  • To view moderation results and analyze the most frequent violation types in the moderated text, see Detection Results.