Content Moderation Plus for LLMs

更新时间:
复制 MD 格式

The Text Moderation PLUS service, an upgrade from Text Moderation Enhanced Edition, moderates prompts and generated text from large language models separately, retrieves standard answers for specific prompts, and lets you enable or disable moderation labels. This topic describes how to use the Text Moderation PLUS service.

Important

For content safety in large models, Alibaba Cloud offers its specialized AI Safety product. We recommend using AI Safety for text moderation in AI applications.

Features

Compared to the general text moderation service (Text Moderation Enhanced Edition), the Text Moderation PLUS service offers moderation for large language models to meet compliance and specific business needs.

Item

Text Moderation PLUS

General text moderation

Use cases

Detects content in large language model applications:

  • llm_query_moderation: Moderates prompts for large language models.

  • llm_response_moderation: Moderates text generated by large language models.

Detects AI-generated content (AIGC):

  • ai_art_detection: Moderates text prompts for AIGC, chatbots, and model-generated text.

Moderation capabilities

  • Supports Chinese and English, focusing on identifying politically sensitive content and accurately detecting baseline violations.

  • Also detects content that baits an AI into generating violations, biased or discriminatory content, and private personal information.

  • Retrieves and returns standard answers from a knowledge base for specific model prompts.

  • Lets you customize a replacement answer library and its content based on risk labels.

  • Supports Chinese and English text, emphasizing detection of baseline violations and negative content.

  • Also detects content that baits an AI into generating violations.

Label system

  • Returns over 60 granular labels across 10 categories, and a confidence score for each label.

  • Supports enabling or disabling labels within the detection scope. You can manage labels in the console.

  • Returns primary labels for 10 categories.

  • Does not support enabling or disabling labels. The console provides a view-only page for labels.

API features

The API operation is TextModerationPlus:

  • The default API limit is 100 QPS.

  • The default input character limits are:

    • 2,000 characters for llm_query_moderation.

    • 5,000 characters for llm_response_moderation.

  • For API parameters, see the API reference.

The API operation is TextModeration:

Risk labels

Label definitions

The enhanced text moderation service for large language models returns over 30 granular labels across 10 categories, along with a confidence score for each label. For content with multiple risk types, the service returns multiple granular labels. The following table lists the risk label values, their confidence score ranges, and their meanings.

Label value

Confidence score

Description

pornographic_adult

0–100. A higher score indicates higher confidence.

Suspected pornographic content

sexual_terms

0–100. A higher score indicates higher confidence.

Suspected sexual health-related content

sexual_suggestive

0–100. A higher score indicates higher confidence.

Suspected vulgar content

sexual_prompts

0–100. A higher score indicates higher confidence.

Suspected prompts intended to generate pornographic content

political_figure

0–100. A higher score indicates higher confidence.

Suspected political figures

political_entity

0–100. A higher score indicates higher confidence.

Suspected political entities

political_n

0–100. A higher score indicates higher confidence.

Suspected sensitive political content

political_p

0–100. A higher score indicates higher confidence.

Suspected content related to prohibited political figures

political_prompts

0–100. A higher score indicates higher confidence.

Suspected prompts intended to generate political content

political_a

0–100. A higher score indicates higher confidence.

Enhanced protection for political content

violent_extremists

0–100. A higher score indicates higher confidence.

Suspected extremist organizations

violent_incidents

0–100. A higher score indicates higher confidence.

Suspected extremist content

violent_weapons

0–100. A higher score indicates higher confidence.

Suspected weapons and ammunition

violent_prompts

0–100. A higher score indicates higher confidence.

Suspected prompts intended to generate violent content

contraband_drug

0–100. A higher score indicates higher confidence.

Suspected drug-related content

contraband_gambling

0–100. A higher score indicates higher confidence.

Suspected gambling-related content

contraband_act

0–100. A higher score indicates higher confidence.

Suspected contraband activities

contraband_entity

0–100. A higher score indicates higher confidence.

Suspected contraband tools

inappropriate_discrimination

0–100. A higher score indicates higher confidence.

Suspected biased or discriminatory content

inappropriate_ethics

0–100. A higher score indicates higher confidence.

Suspected content with unethical values

inappropriate_profanity

0–100. A higher score indicates higher confidence.

Suspected abusive or insulting content

inappropriate_oral

0–100. A higher score indicates higher confidence.

Suspected vulgar slang

inappropriate_superstition

0–100. A higher score indicates higher confidence.

Suspected superstitious content

inappropriate_nonsense

0–100. A higher score indicates higher confidence.

Suspected spam or nonsensical content

privacy_p

0–100. A higher score indicates higher confidence.

Suspected personal information

privacy_b

0–100. A higher score indicates higher confidence.

Suspected sensitive business data

religion_b

0–100. A higher score indicates higher confidence.

Suspected Buddhism-related content

religion_t

0–100. A higher score indicates higher confidence.

Suspected Taoism-related content

religion_c

0–100. A higher score indicates higher confidence.

Suspected Christianity-related content

religion_i

0–100. A higher score indicates higher confidence.

Suspected Islam-related content

religion_h

0–100. A higher score indicates higher confidence.

Suspected Hinduism-related content

pt_to_sites

0–100. A higher score indicates higher confidence.

Suspected traffic redirection to external sites

pt_by_recruitment

0–100. A higher score indicates higher confidence.

Suspected ads for get-rich-quick schemes or part-time jobs

pt_to_contact

0–100. A higher score indicates higher confidence.

Suspected accounts for ad-based traffic redirection

customized

0–100. A higher score indicates higher confidence.

Matched a custom library

Manage labels

You can enable or disable each risk label in the console. For some risk labels, you can also configure more granular detection scopes. For more information, see the Content Security console.

  1. In the left-side navigation bar, select Content Moderation (Enhanced) > text moderation > rule configuration.

  2. On the rule management tab, for a rule such as text moderation for large language model input (llm_query_moderation), click Manage detection rules in the Actions column.

    1. Select the detection type that you want to adjust, such as inappropriate content detection.

    2. Click Edit and modify the detection status.

    3. Click Save. The new configuration takes about 2 to 5 minutes to take effect in the production environment.

On the rule configuration page, the left pane lists detection types, including pornographic content detection, political content detection, violent and terrorist content detection, contraband content detection, inappropriate content detection, privacy-related content detection, religious content detection, and ad content detection. After you select a detection type, the table on the right displays the label values, meanings, granular detection scopes, and detection status (enabled/disabled) for that type. Use the toggle switch to enable or disable detection for a specific label.

Answer library management

You can manage answer libraries on the console. For more information, see the Content Moderation console.

  1. In the left-side navigation pane, choose Enhanced API moderation > text moderation > library management.

  2. On the Answer Library Management tab, you can add or modify answer libraries and their answers.

    1. Click Create Answer Library and enter a name for the library. You can choose to Add Answers in Batches or Upload File. Alternatively, you can select Create library first and add answers later. You can add up to 10,000 answers and create a maximum of 100 answer libraries per account. Each answer cannot exceed 1,000 characters. To add answers in batches, enter one answer per line. You can upload Excel files.

    2. In the list of answer libraries, click manage in the Actions column to open the answer maintenance page.

    3. Click Add, where you can Add answers in batches.

    4. You can add, delete, or modify the library's answers. Use the search box to find specific answers. You can also select multiple answers and click Delete in Batches.

Custom answer library configuration

You can configure answer libraries based on labels on the console. For more information, see the Content Moderation console.

  1. In the left-side navigation pane, choose Enhanced API moderation > text moderation > library management.

  2. On the Rule Management tab, for the LLM input moderation (llm_query_moderation) scenario, click Manage Detection Rules in the Actions column.

    1. Select the detection type that you want to adjust, such as ad content detection.

    2. Click Edit, then modify the settings for Custom answer library configuration.

    3. In the Answer library selection column, select an existing answer library or click Add answer library to create a new one. You can associate up to three answer libraries with each label.

    4. Click Save to save the new custom answer library configuration. The new configuration takes effect in about one minute and applies to the production environment. You can view the answer library currently assigned to each label. In the Custom answer library configuration section, the table shows each label and its meaning. Use the drop-down list on the right to select an answer library.

Billing

The Text Moderation Plus service supports pay-as-you-go and resource package deduction.

Pay-as-you-go

After you enable the Text Moderation Plus service, the default billing method is pay-as-you-go. You are billed daily based on your actual usage. You are not charged for the service if you do not use it.

Type

Service scenarios

Unit price

advanced text moderation (text_advanced)

  • LLM input text moderation: llm_query_moderation

  • LLM response text moderation: llm_response_moderation

CNY 15.00 per 10,000 calls

Resource package deduction

For high-volume or predictable usage, we recommend purchasing a resource package in advance. Larger resource packages come with greater discounts. You can purchase and use multiple resource packages simultaneously. For more information, see Purchase a resource package for Text Moderation Plus.

This resource package covers Text Moderation Plus usage and cannot be shared with the Content Moderation 1.0 traffic package. The specific deduction factors are as follows:

Type

Service scenarios

Deduction factor

advanced text moderation (text_advanced)

  • LLM input text moderation: llm_query_moderation

  • LLM response text moderation: llm_response_moderation

The deduction factor is 2. This means that each successful API call deducts 2 calls from your resource package quota.

For example, if you purchase a resource package with a quota of 10 calls and you make one successful API call, the system deducts 2 calls from your package, leaving a remaining quota of 8 calls.

Integration

Step 1: Activate the service

Visit Activate Service to activate the Text Moderation Enhanced Edition service.

Step 2: Grant permissions to a RAM user

Before using the SDK or an API, you must grant permissions to a RAM user. API calls require an AccessKey for authentication, which you can create for your Alibaba Cloud account or a RAM user. For instructions on creating an AccessKey, see Obtain an AccessKey.

Procedure

  1. Log in to the RAM console with your Alibaba Cloud account.

  2. Create a RAM user.

    For detailed instructions, see Create a RAM user.

  3. Attach the AliyunYundunGreenWebFullAccess system policy to the RAM user to grant permissions.

    For detailed instructions, see Manage RAM user permissions.

    The RAM user can now call Content Moderation APIs.

Step 3: Install and integrate the SDK

The Content Moderation service is available in the following regions. For the SDK integration guide, see Integration Guide.

Region

Public endpoint

VPC endpoint

China (Beijing)

green-cip.cn-beijing.aliyuncs.com

green-cip-vpc.cn-beijing.aliyuncs.com

China (Shanghai)

green-cip.cn-shanghai.aliyuncs.com

green-cip-vpc.cn-shanghai.aliyuncs.com

China (Hangzhou)

green-cip.cn-hangzhou.aliyuncs.com

green-cip-vpc.cn-hangzhou.aliyuncs.com

China (Shenzhen)

green-cip.cn-shenzhen.aliyuncs.com

green-cip-vpc.cn-shenzhen.aliyuncs.com

China (Chengdu)

green-cip.cn-chengdu.aliyuncs.com

N/A

API

Usage

Call this API to create a text content detection task. To construct an HTTP request, see HTTPS native call. Alternatively, use a prebuilt request as described in getting started.

You can run this API directly in OpenAPI Explorer without the hassle of calculating a signature. After a successful API call, OpenAPI Explorer automatically generates an SDK code sample.

  • API: TextModerationPlus

  • Available regions and endpoints:

Region

Public endpoint

VPC endpoint

China (Shanghai)

https://green-cip.cn-shanghai.aliyuncs.com

https://green-cip-vpc.cn-shanghai.aliyuncs.com

China (Beijing)

https://green-cip.cn-beijing.aliyuncs.com

https://green-cip-vpc.cn-beijing.aliyuncs.com

China (Hangzhou)

https://green-cip.cn-hangzhou.aliyuncs.com

https://green-cip-vpc.cn-hangzhou.aliyuncs.com

China (Shenzhen)

https://green-cip.cn-shenzhen.aliyuncs.com

https://green-cip-vpc.cn-shenzhen.aliyuncs.com

China (Chengdu)

https://green-cip.cn-chengdu.aliyuncs.com

N/A

  • Billing: This is a billable API. metering and billing apply only to requests that return an HTTP status code of 200. Charges do not apply to requests that result in other error codes. For more information on billing, see the billing overview.

QPS limit

The per-user QPS limit for this API is 100 QPS. If this limit is exceeded, subsequent API calls will be throttled, which may impact your business. Please manage your call rate accordingly.

Request parameters

Parameter

Type

Required

Example value

Description

Service

String

Yes

llm_query_moderation

  • Moderation for large language model queries: llm_query_moderation

  • Moderation for large language model responses: llm_response_moderation

ServiceParameters

JSON string

Yes

The parameter set for the specified moderation service. For field descriptions, see ServiceParameters.

Table 1. ServiceParameters

Parameter

Type

Required

Example

Description

content

String

Yes

Text to moderate

The text to be moderated. The character limits for this content are as follows:

  • The llm_query_moderation service has a limit of 2,000 characters.

  • The llm_response_moderation service has a limit of 5,000 characters.

accountId

String

No

13****

The unique ID of the account. The text moderation engine uses this ID to consider context from previous requests with the same account ID.

Note

Recommended for the llm_query_moderation service.

sessionId

String

No

14****

The ID of the streaming session. The text moderation engine concatenates text segments from the same session and moderates the combined content. The combined content cannot exceed the service's character limit.

Note

Recommended for the llm_response_moderation service. The sessionId and accountId parameters are mutually exclusive.

dataId

String

No

text0424****

A unique identifier for your business data.

The ID must be 64 characters or less and can contain letters, digits, underscores (_), hyphens (-), and periods (.).

Return parameters

Parameter

Type

Value

Description

Code

Integer

200

The status code. For details, see Code Description.

Data

JSONObject

{"Result":[...],"Advice":[...]}

The audit result data. For details, see the Data parameter.

Message

String

OK

The response message.

RequestId

String

AAAAAA-BBBB-CCCCC-DDDD-EEEEEEEE****

The request ID.

Table 2. Data

Parameter

Type

Example value

Description

Result

JSONArray

[{"confidence":100.0,"label":"political_entity","riskWords":"sensitive_word_1"},{...}]

Detection results, including risk labels and confidence scores. For details, see Result.

RiskLevel

String

high

The risk level, determined by the risk score thresholds you configure. Valid values:

  • high: High risk

  • medium: Medium risk

  • low: Low risk

  • none: No risk detected

Note

Handle high-risk content immediately. Review medium-risk content manually. Address low-risk content only if you need a high recall rate; otherwise, treat it as risk-free content. You can configure the risk score thresholds in the Content Moderation Console.

Advice

JSONArray

[{"Answer":"This is a standard answer"}]

The llm_query_moderation service returns a standard answer if the input prompt matches a query in a configured knowledge base. For details, see Advice.

DataId

String

text0424****

The data ID of the object to moderate.

Note

If you pass the dataId parameter in the request, this field returns the same value.

Table 3. Result

Parameter

Type

Value

Description

Label

String

political_xxx

The label returned by text analysis. Multiple labels and scores may be detected. For a list of supported labels, see risk labels.

Confidence

Float

81.22

The confidence score, which ranges from 0 to 100, with a precision of up to two decimal places. Some labels do not have a confidence score.

Riskwords

String

AA,BB,CC

A comma-separated list of sensitive words detected in the content. This field may not be returned for all labels.

CustomizedHit

JSONArray

[{"LibName":"...","Keywords":"..."}]

If content matches a term in a custom library, the Label field is set to "customized". This field contains the name of the custom library and the matched custom terms. For more information, see CustomizedHit.

Table 4. CustomizedHit

Parameter

Type

Example

Description

LibName

String

custom library 1

The custom library name.

Keywords

String

custom keyword 1, custom keyword 2

A comma-separated list of custom keywords.

Table 5. Advice

Parameter

Type

Value

Description

Answer

String

This is a standard answer.

The moderation service returns an alternative response in the following scenarios:

  • If the input prompt matches an entry in a knowledge base, the system returns one or more standard answers.

Note

This applies only to the llm_query_moderation service.

  • If the content triggers a risk label and matches an entry in a user-defined alternative response library, the system returns a random custom answer.

  • If the content triggers a risk label and matches an entry in the system alternative response library, the system returns a random default answer.

HitLabel

String

political_xxx

The highest-risk label returned by text content moderation. For a list of supported labels, see risk labels.

HitLibName

String

Custom Alternative Response Library 001

The name of the custom alternative response library.

Example

Request example

{
    "Service": "llm_query_moderation",
    "ServiceParameters": {
        "content": "testing content"
    }
}
  • Response when no risks are detected.

{
    "Code": 200,
    "Data": {
        "Result": [
            {
                "Label": "nonLabel"
            }
        ],
        "RiskLevel": "none"
    },
    "Message": "OK",
    "RequestId": "AAAAAA-BBBB-CCCCC-DDDD-EEEEEEEE****"
}
  • Response when the prompt matches the mandatory and alternative response library.

{
    "Code": 200,
    "Data": {
        "Advice": [
            {
                "Answer": "This is a sample standard answer."
            }
        ],
        "Result": [
            {
                "Label": "political_entity",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            },
            {
                "Label": "political_figure",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            }
        ],
        "RiskLevel": "high"
    },
    "Message": "OK",
    "RequestId": "AAAAAA-BBBB-CCCCC-DDDD-EEEEEEEE****"
}
  • Response when the prompt matches a user-defined rejection and alternative response library.

{
    "Code": 200,
    "Data": {
        "Advice": [
            {
                "HitLabel": "political_entity",
                "Answer": "This is a sample standard answer.",
                "HitLibName": "political_entity-001"
            }
        ],
        "Result": [
            {
                "Label": "political_entity",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            },
            {
                "Label": "political_figure",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            }
        ],
        "RiskLevel": "high"
    },
    "Message": "OK",
    "RequestId": "AAAAAA-BBBB-CCCCC-DDDD-EEEEEEEE****"
}
  • Response when the prompt matches the system rejection and alternative response library.

{
    "Code": 200,
    "Data": {
        "Advice": [
            {
                "HitLabel": "political_entity",
                "Answer": "This is a sample standard answer."
            }
        ],
        "Result": [
            {
                "Label": "political_entity",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            },
            {
                "Label": "political_figure",
                "Confidence": 100.0,
                "RiskWords": "word_A,word_B,word_C"
            }
        ],
        "RiskLevel": "high"
    },
    "Message": "OK",
    "RequestId": "AAAAAA-BBBB-CCCCC-DDDD-EEEEEEEE****"
}

Code

Code

Status code

Description

200

OK

The request succeeded.

400

BAD_REQUEST

The request is invalid because a request parameter is incorrect or missing. Check the parameters and retry.

408

PERMISSION_DENY

Permission denied. This can be due to an unauthorized account, overdue payments, an inactive service, or a disabled account.

500

GENERAL_ERROR

An internal server error occurred. This may be a temporary issue. Retry the request. If the issue persists, contact Support.

581

TIMEOUT

The request timed out. Retry the request. If the issue persists, contact Support.

588

EXCEED_QUOTA

The request rate exceeds your quota.