EMR Serverless StarRocks AI 函数-开源大数据平台 E-MapReduce(EMR)-阿里云帮助中心

StarRocks AI Function 将大语言模型 (LLM) 的能力直接嵌入到 OLAP 分析流程中，您可以在一条 SQL 内实现从数据处理、分析到 AI 推理的全过程，极大地提升了 BI 和数据分析的效率。

核心优势

数据不出湖，分析推理一体化：原始数据始终留存于 StarRocks 内，AI 函数仅按需加密传输待处理字段至外部模型服务，杜绝数据复制与跨系统流转，兼顾安全合规与分析效率。
原生 SQL 交互，简单易用：通过内置的 AI Function，您可以使用标准 SQL 将 LLM 的强大能力无缝集成到 OLAP 分析中，无需学习新的编程语言或工具。
结果实时可用，高效分析：AI 函数返回的结果可立即参与后续的 JOIN、聚合、过滤等计算，无需二次导入，实现了 AI 分析与数据查询的实时联动。
高并发效率：仅需 4 线程即可驱动数百并发的 LLM 调用；相比之下，客户端方案需要等量进程支撑，操作系统的调度往往会先于业务出现瓶颈。
零运维限流：内置三层防护机制并兼容百炼非标接口，您无需理解 RPM/TPM 等限流机制即可平稳运行。
Token 成本优化：通过谓词下推减少调用量、缓存消除重复请求、精确归账，同样任务的费用可降低 30% 以上。
工业级可靠性：提供行级容错、智能重试与 Profile 可观测能力，百万行批处理可实现零人工干预。

用量限制

单用户享有 100 万 Token 的免费试用额度。超出免费试用额度后，超出部分将根据实际Token消耗量进行收费，计费详见AI Function计费。

前提条件

内核版本要求

3.3.20-2.1.1及以上
3.5.16-2.1.1及以上

配置网络访问

目的：StarRocks 的 BE 节点需要通过公网访问外部的 AI 模型服务端点，因此您必须为其开通公网访问能力。

配置方法：

为 StarRocks 集群所在的VPC配置 NAT 网关，并设置 SNAT 规则，允许 BE 节点主动访问公网。

创建 NAT 网关并为其绑定弹性公网 IP。
配置 SNAT 规则，将 BE 节点所在网段的流量通过 NAT 网关路由至公网。
确保相关的 VPC 路由表和安全组策略已放行出向的网络流量。

详细配置步骤，请参见公网 NAT 网关。

函数一览

下表汇总了 StarRocks 内置的所有 AI Function，涵盖文本理解、生成、转换与安全处理四大能力，全部支持原生 SQL 调用、结果直接参与后续计算（JOIN/AGG/FILTER），无需 ETL 中转。

函数	功能简述	适用场景
`ai_sentiment(text)`	情感分析，返回 `positive/negative/neutral/mixed/unknown`。	运营分析与商业智能（BI），例如用户评论情绪分析看板。
`ai_summarize(text)`	文本摘要	数据治理，例如评论和工单自动摘要。
`ai_fix_grammar(text)`	语法与拼写自动校正	数据治理和质检，例如用户 UGC 内容质检。
`ai_redact(text, categories)`	PII 脱敏	安全合规场景，例如开发/测试环境敏感数据脱敏、审计日志脱敏输出。
`ai_classify(text, labels)`	按自定义标签分类	内容与知识管理，例如商品评论打标。
`ai_extract(text, entity_labels)`	抽取实体并以 JSON 返回	内容与知识管理，例如构建销售线索知识图谱。
`ai_translate(text, source_lang, target_lang)`	机器翻译	全球化，例如国际化报表自动本地化。
`ai_similarity(text1, text2)`	语义相似度	精准检索和RAG，例如FAQ 匹配度排序。
`ai_complete(model, prompt)`	通用 AI 补全	通用生成与问答，例如营销活动话术批量生成、通用问答。
`ai_complete(model, prompt, params)`	带参数的 AI 补全	通过`params`控制模型行为（如 temperature、max_tokens 等）。
`ai_custom_query(resource, prompt)`	自定义 Resource 查询	支持私有模型集成，例如合规场景专用小模型部署。
`ai_filter(text, condition)`	AI 条件过滤	数据筛选，例如基于语义条件过滤数据行。

函数详情

ai_sentiment（AI情感分析‌）

对输入文本进行情感分析。

语法
```
ai_sentiment(text)
```
参数

text（VARCHAR）: 需要分析的文本内容。
返回值

返回一个字符串（VARCHAR），其值可能是 'positive'（正面）、'negative'（负面）、'neutral'（中性）、'mixed'（混合）或 'unknown'（未知）。

示例

SELECT ai_sentiment('I am happy');  
-- 返回： 'positive'

ai_classify（AI分类）

根据您提供的一组标签，对输入文本进行分类。

语法
```
ai_classify(text, labels)
```
参数
- text（VARCHAR）：需要分类的文本内容。
- labels（ARRAY<VARCHAR>）：一个包含候选标签的数组。此数组必须包含至少 2 个、至多 20 个元素。
返回值

返回一个 JSON 对象，包含分类结果。如果无法分类，则返回 NULL。

示例

SELECT ai_classify("My password is leaked.", ["urgent", "not urgent"]);
-- 返回： {"labels": ["urgent"]}

ai_extract（实体提取）

从文本中抽取您指定的实体信息。

语法
```
ai_extract(text, entity_labels)
```
参数
- text（VARCHAR）：需要从中抽取信息的文本。
- entity_labels（ARRAY<VARCHAR>）：一个包含待抽取实体类型的数组，例如 ['person', 'location']。
返回值

返回一个 JSON 对象。对象的键（key）是您在 entity_labels 中指定的实体类型，值（value）是从文本中抽取的对应内容。

示例

SELECT ai_extract('John Doe lives in New York and works for Acme Corp.', ['person', 'location', 'organization']); 
-- 返回： {"person":"John Doe","location":"New York","organization":"Acme Corp"}

ai_fix_grammar（语法纠错）

对输入文本进行语法和拼写校正。

语法
```
ai_fix_grammar(text)
```
参数

text（VARCHAR）：需要校正的文本。
返回值

返回一个经过语法和拼写校正的字符串（VARCHAR）。

示例

SELECT ai_fix_grammar('This sentence have some mistake'); 
-- 返回： "This sentence has some mistake"

ai_complete（AI补全）

通用和带参数的 AI 补全函数，支持指定模型进行内容生成和问答。

语法

ai_complete(model, prompt)
ai_complete(model, prompt, params)

参数
- model（VARCHAR）：模型名称，例如 '__system__' 表示系统模型。
- prompt（VARCHAR）：指导模型生成内容的提示词。
- params（MAP，可选）：控制模型行为的参数，支持 temperature、max_tokens、top_p 等。
返回值

返回一个根据提示词生成的字符串（VARCHAR）。

示例

-- 简单示例
SELECT ai_complete('__system__', '为一场夏季自行车促销活动写一个吸引人的邮件标题，折扣为八折');
-- 返回： "夏日骑行狂欢，八折优惠限时开启！"

-- 带参数示例
SELECT ai_complete('__system__', 'What is the capital of France?', map{'temperature': '0.1', 'max_tokens': '50'});
-- 返回： "Paris"

ai_filter（AI条件过滤）

基于自然语言条件对文本进行过滤判断。

语法
```
ai_filter(text, condition)
```
参数
- text（VARCHAR）：需要判断的文本内容。
- condition（VARCHAR）：自然语言表述的过滤条件。
返回值

返回 BOOLEAN，表示文本是否满足条件。

示例

SELECT ai_filter('这个产品质量很差，不推荐购买', '负面评价');
-- 返回： true

ai_redact（PII脱敏）

对文本中的指定实体进行 PII 脱敏处理。

语法
```
ai_redact(text, categories)
```
参数
- text（VARCHAR）：需要脱敏的文本。
- categories（ARRAY<VARCHAR>）：一个包含待脱敏实体类型的数组，例如 ['person', 'email', 'phone']。
返回值

返回一个将指定实体信息脱敏后的字符串（VARCHAR）。

示例

-- 简单示例
SELECT ai_redact('John Doe lives in New York. His email is john.doe@example.com.', ['person', 'email']); 
-- 返回： "[REDACTED] lives in New York. His email is [REDACTED]."

ai_translate（机器翻译）

将文本翻译成指定的目标语言。

语法

ai_translate(text, source_lang, target_lang)

参数
- text (VARCHAR）: 需要翻译的文本。
- source_lang (VARCHAR）: 源语言代码，例如 '中文'。
- target_lang (VARCHAR）: 目标语言代码，例如 'en' （英语）, 'zh' （中文）, 'es' （西班牙语）。建议遵循 ISO 639-1 语言代码标准。
返回值

返回翻译后的字符串（VARCHAR）。

示例

SELECT ai_translate('Hello, how are you?', 'en', 'es'); 
-- 返回： "Hola, ¿cómo estás?"

ai_similarity（语义相似度）

计算两个文本之间的语义相似度。

语法
```
ai_similarity(text1, text2)
```
参数
- text1（VARCHAR）: 第一个文本。
- text2（VARCHAR）: 第二个文本。
返回值

返回一个浮点数 (FLOAT)，范围在 0 到 1 之间，表示两个文本的语义相似度。分数越高，表示语义越接近。1.0 表示文本完全相同。该分数主要用于排序。

示例

SELECT
  ai_similarity (
    'I enjoy hiking in the mountains.',
    'I love walking through mountain trails.'
  );
-- 返回： 0.82

ai_summarize（AI总结概览）

对输入文本生成摘要。

语法
```
ai_summarize(text)
```
参数
- text （VARCHAR）: 需要生成摘要的文本。
返回值

返回一个文本摘要字符串（VARCHAR）。

示例

SELECT ai_summarize('Apache Spark is a unified analytics engine for large-scale data processing.It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.'); 
-- 返回： “Spark: unified engine for large-scale data processing with APIs and tools.”

ai_custom_query（自定义Resource查询）

使用用户自定义的 Resource 进行查询。

语法
```
ai_custom_query(resource, prompt)
```
参数
- resource (VARCHAR)：用户自定义的 Resource 名称。
- prompt (VARCHAR)：查询提示词。
返回值

返回一个字符串（VARCHAR）。

示例

SELECT ai_custom_query('my_knowledge_base', '什么是 StarRocks？');
-- 返回： "StarRocks 是一个高性能分析型数据仓库..."

附录：集群配置参数

您可以通过 ADMIN SET CONFIG 命令修改以下 BE 动态参数。

参数名	参数详情
`ai_function_classify_batch_size`	参数类型: BE 动态参数默认值: 10 是否重启: 否数据类型: INT 描述：分类功能一次AI模型请求批处理大小。备注：批处理可减少模型请求次数，缩短响应时间。
`ai_function_classify_prompt`	参数类型: BE 动态参数默认值: "Classify each of the following texts into one of the following JSON encoded labels: $0. Return only the labels in a JSON array string (not a JSON object) in the same order as the input. Output only the label. Texts: $1" 是否重启: 否数据类型: String 描述: 分类功能的提示词模板。备注: $0为标签列表(JSON编码),$1为待分类文本列表。
`ai_function_custom_query_prompt`	参数类型: BE 动态参数默认值: "$0. For each of the following inputs, provide a response according to the instruction above. Return only the responses in a JSON array string (not a JSON object) in the same order as the input. Output only the response content without any additional text. Inputs: $1" 是否重启: 否数据类型: String 描述: 自定义查询功能的提示词模板。备注: $0为查询指令,$1为输入文本列表。
`ai_function_extract_batch_size`	参数类型: BE 动态参数默认值: 1 是否重启: 否数据类型: INT 描述: 提取功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_extract_prompt`	参数类型: BE 动态参数默认值: "Extract a value for each of the JSON encoded labels from the texts below. For each label, only extract a single value. Labels: $0. Output the extracted values as a JSON object array in the same order as the input texts. Output only the JSON. Do not output a code block for the JSON. Texts: $1" 是否重启: 否数据类型: String 描述: 提取功能的提示词模板。备注: $0为标签列表(JSON编码),$1为待提取文本列表。
`ai_function_fix_grammar_batch_size`	参数类型: BE 动态参数默认值: 3 是否重启: 否数据类型: INT 描述: 语法修正功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_fix_grammar_prompt`	参数类型: BE 动态参数默认值: "Fix the grammar in each of the following texts. Return only the corrected texts in a JSON array string (not a JSON object) in the same order as the input. Output only the corrected text. Texts: $0" 是否重启: 否数据类型: String 描述: 语法修正功能的提示词模板。备注: $0为待修正文本列表。
`ai_function_gen_batch_size`	参数类型: BE 动态参数默认值: 1 是否重启: 否数据类型: INT 描述: 文本生成功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_gen_prompt`	参数类型: BE 动态参数默认值: "$0. Return only the generated content in a JSON array string (not a JSON object). Output only the generated content." 是否重启: 否数据类型: String 描述: 文本生成功能的提示词模板。备注: $0为生成指令。
`ai_function_http_connect_timeout_ms`	参数类型: BE 动态参数默认值: 10000 是否重启: 否数据类型: INT 描述: AI模型HTTP连接超时时间(毫秒)。备注: 默认10秒。
`ai_function_http_timeout_ms`	参数类型: BE 动态参数默认值: 600000 是否重启: 否数据类型: INT 描述: AI模型HTTP请求超时时间(毫秒)。备注: 默认10分钟。
`ai_function_mask_batch_size`	参数类型: BE 动态参数默认值: 3 是否重启: 否数据类型: INT 描述: 掩码功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_mask_prompt`	参数类型: BE 动态参数默认值: "Mask the values for each of the JSON encoded labels in the texts below. Labels: $0. Replace the values with the text [MASKED]. Output only the masked texts in a JSON array string (not a JSON object) in the same order as the input. Do not output anything else. Texts: $1" 是否重启: 否数据类型: String 描述: 掩码功能的提示词模板。备注: $0为标签列表(JSON编码),$1为待掩码文本列表。
`ai_function_max_inflight_requests`	参数类型: BE 动态参数默认值: 100 是否重启: 否数据类型: INT 描述: AI模型最大并发HTTP请求数。备注: 控制同时进行的AI请求数量,避免过载。
`ai_function_model_api_key`	参数类型：BE 动态参数默认值: "sk-89c1***********95d0" 是否重启: 否数据类型：String 描述：用来访问百炼平台接口的密钥。备注：详情请参见获取API Key。
`ai_function_model_endpoint`	参数类型: BE 动态参数默认值: "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions" 是否重启: 否数据类型: String 描述: 模型服务端口,或其他兼容OpenAI API的Chat/Completions模型服务端口。备注: 其他兼容OpenAI API的Chat/Completions模型服务,可根据相应的API文档填写端口值。
`ai_function_model_type`	参数类型: BE 动态参数默认值: "qwen-plus" 是否重启: 否数据类型: String 描述: 调用的服务端的具体模型。备注: 支持文本生成类别的模型。
`ai_function_query_batch_size`	参数类型: BE 动态参数默认值: 1 是否重启: 否数据类型: INT 描述: 查询功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_query_prompt`	参数类型: BE 动态参数默认值: "For each of the following texts, $0. Return only the answers in a JSON array string (not a JSON object) in the same order as the input. Texts: $1" 是否重启: 否数据类型: String 描述: 查询功能的提示词模板。备注: $0为查询指令,$1为待查询文本列表。
`ai_function_request_template`	参数类型: BE 动态参数默认值: "{"model": "$0", "messages": [{"role": "system", "content": "$1"}, {"role": "user", "content": "$2"}], "response_format": {"type": "json_object"}$3}" 是否重启: 否数据类型: String 描述: AI模型请求模板(OpenAI兼容格式)。备注: $0为模型名称,$1为系统提示词,$2为用户提示词,$3为额外参数(可选,以逗号开头)。
`ai_function_sentiment_batch_size`	参数类型: BE 动态参数默认值: 10 是否重启: 否数据类型: INT 描述: 情感分析功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_sentiment_prompt`	参数类型: BE 动态参数默认值: "Classify each of the following texts into one of the following labels: [positive, negative, neutral, mixed]. Return only the labels in a JSON array string (not a JSON object) in the same order as the input. Output only the label. Text: $0" 是否重启: 否数据类型: String 描述: 情感分析功能的提示词模板。备注: $0为待分析文本列表。
`ai_function_similarity_batch_size`	参数类型: BE 动态参数默认值: 10 是否重启: 否数据类型: INT 描述: 相似度计算功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_similarity_prompt`	参数类型: BE 动态参数默认值: "Calculate the similarity between each pair of texts. Return only the similarity scores in a JSON array string (not a JSON object) in the same order as the input. Each score should be a number between 0 and 1, rounded to 2 decimal places. Do not output anything else. Text pairs: $0" 是否重启: 否数据类型: String 描述: 相似度计算功能的提示词模板。备注: $0为文本对列表。
`ai_function_summarize_batch_size`	参数类型: BE 动态参数默认值: 1 是否重启: 否数据类型: INT 描述: 摘要功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_summarize_prompt`	参数类型: BE 动态参数默认值: "Summarize each of the following texts in $0 words. Return only the summaries in a JSON array string (not a JSON object) in the same order as the input. Output only the summary. Texts: $1" 是否重启: 否数据类型: String 描述: 摘要功能的提示词模板。备注: $0为目标字数,$1为待摘要文本列表。
`ai_function_system_prompt`	参数类型: BE 动态参数默认值: "You are a helpful assistant." 是否重启: 否数据类型: String 描述: 所有AI功能的系统提示词。备注: 用于设置AI助手的基础角色和行为。
`ai_function_translate_batch_size`	参数类型: BE 动态参数默认值: 3 是否重启: 否数据类型: INT 描述: 翻译功能一次AI模型请求批处理大小。备注: 批处理可减少模型请求次数,缩短响应时间。
`ai_function_translate_prompt`	参数类型: BE 动态参数默认值: "Translate each of the following texts into $0. Return only the translated texts in a JSON array string (not a JSON object) in the same order as the input. Output only the translated text. Texts: $1" 是否重启: 否数据类型: String 描述: 翻译功能的提示词模板。备注: $0为目标语言,$1为待翻译文本列表。

查询AI Function相关参数项。

select * from information_schema.be_configs where NAME like "ai_%";