Automated data analysis-Platform For AI(PAI)-阿里云帮助中心

Prerequisites

You have deployed a RAG service and selected Official Image > pai-rag:0.3.1 as the image. For more information, see Deploy a custom RAG service.

Step 1: Configure the LLM

On the Elastic Algorithm Service (EAS) console, click the name of your target RAG service, and then click View Web Application in the upper-right corner to open the web UI.

On the System Settings page, configure the LLM service and click Save Model Configuration.

Configure one of the following two types of LLM services:

Service type	Parameter	Description
Open-source large language model deployed on Elastic Algorithm Service (EAS)	URL	Configure the access endpoint (EAS_SERVICE_URL) and token (EAS_TOKEN). Important To connect to the LLM service using a private address, the RAG service and the LLM service must be in the same Virtual Private Cloud (VPC). To connect to the LLM service using a public address, you must configure a VPC with public network access for the RAG service. For more information, see Access public or private resources from EAS. To obtain this information, perform the following steps: Deploy an LLM. On the Elastic Algorithm Service (EAS) page, click the name of the deployed LLM service. In the Basic Information section, click View Invocation Information.
	key
	Model name	Enter the model type that you selected when you deployed the LLM service, such as Qwen2.5-7B-Instruct.
Model Studio service Note To use a Model Studio model, you must enable public or private network access for the RAG service. For more information, see Access public or private resources from EAS. Calls to Model Studio models are billed separately. For more information, see Billing.	URL	Set this parameter to `https://dashscope.aliyuncs.com/compatible-mode/v1`.
	key	Obtain your API key from the API Key Management page.
	Model name	Supported models include qwen-max and qwen-turbo. We recommend qwen-max.

Step 2: Analyze data

In the RAG service web UI, switch to the Data Analysis tab.

PAI-RAG supports two data types for analysis: database (a connected MySQL database) and datafile (an uploaded table file).

Configure database analysis

1. Connect to a database

Only MySQL databases are supported. This topic uses an RDS for MySQL as an example, but you can use other MySQL databases.

After you prepare your database, go to the Data Analysis tab in the RAG service web UI, select database as the data type, configure the following parameters, and then click Load Database Information to connect. To update the description, modify it and load the information again.

Basic database configuration

Parameter	Description
Database type	The value is mysql and cannot be changed.
Database port	The default value is 3306.
Database host	Enter the private or public address of the database. For an RDS for MySQL instance, see View and manage instance connection addresses and ports. Note If you use a private address, the RAG service and the RDS instance must be in the same VPC. If you use a public address, you must configure a VPC with public network access for the RAG service. For more information, see Access public or private resources from EAS.
username	The username for the database account.
password	The password for the database account.
Database name	The name of the target database.
Table names	The names of the tables to analyze. Separate multiple table names with a comma (,), for example, `table_A,table_B`. If you leave this parameter empty, the service uses all tables in the database. Important The MySQL database must contain at least one table. Otherwise, the database information cannot be loaded.
Table description	Optional. Add supplementary column descriptions. This parameter is empty by default. Add a general description for a table or detailed descriptions for specific columns. Use the following format: `{"table_A":"Column a represents xxx, and the data in column b is in xxx format","table_B":"This table is mainly used for xxx"}`. You must use English punctuation, such as double quotation marks (""), colons (:), and commas (,). This feature is for temporary debugging and reviewing data analysis results. If you have extensive descriptions, we recommend one of the following methods: Add the descriptions as comments in your database to make them permanent. Organize the descriptions into a CSV file and upload it. For more information, see Configure table file analysis.

Enhancements for large databases

Parameter	Description
Optimize database retrieval with vector embedding	Use case: Recommended when your database tables have many columns, for example, if the total number of columns across all tables is greater than 50. How it works: This feature performs vector retrieval on schema and value information based on your question to quickly filter for potentially useful tables and columns. This prevents the prompt sent to the LLM from exceeding length limits or degrading SQL generation quality. The value-based retrieval fetches unique values from text columns in the database, with limits on the maximum number of columns and unique values per column. A higher limit widens the value range and improves SQL accuracy but increases latency. How to use: Select Yes and configure the following parameters: Maximum Number of Columns: The maximum number of table columns to include in the retrieval. Maximum Number of Unique Values: The maximum number of unique values per column to include in the retrieval. A larger number may increase processing time.
Select tables and columns with LLM	Use case: Suitable for databases with many tables and columns. How it works: The LLM filters for relevant table columns based on your query. How to use: Select Yes.
Use database query history/examples	Use case: Suitable for scenarios where you want to leverage historical or similar queries to improve SQL generation. How it works: This feature uses vector retrieval to find historical or similar queries that closely match the current query. These query-SQL pairs act as references, similar to an FAQ, to help improve the quality of the generated SQL. How to use: Select Yes and upload a JSON file of query-SQL pairs in the Upload Q-SQL JSON File section. The JSON file must use the following format: `[ { "query": "Find the number of pets with a weight greater than 10.", "SQL": "SELECT count() FROM pets WHERE weight > 10" }, { "query": "Find the maximum weight for each pet type. List the maximum weight and the pet type.", "SQL": "SELECT max(weight) , petType FROM pets GROUP BY petType" }, { "query": "Find the number of pets owned by students older than 20.", "SQL": "SELECT count() FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid = T2.stuid WHERE T1.age > 20" } ]`
Upload data table descriptions	How it works: Providing descriptions for all columns in each table helps the LLM better understand the meaning of the columns, which improves SQL generation. How to use: Select Yes and upload a CSV description file in the Upload Database Description File section. The CSV filename must match the target table name. Upload description files for multiple tables. For the required format, see Data format in the CSV file.

Data format in the CSV file

original_column_name	column_name	column_description	data_format	value_description
cds		California Department Schools	text	useless
rtype		rtype	text	useless
sname	school name	school name	text
dname	district name	district name	text
cname	county name	county name	text
enroll12	enrollment (1st-12th grade)	enrollment (1st-12th grade)	integer
NumTstTakr	Number of Test Takers	Number of Test Takers in this school	integer	number of test takers in each school
AvgScrRead	average score	average score	integer	average score for reading
NumGE1500	Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500	Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500	integer	Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500 \n\n common sense evidence: \n Excellence Rate = NumGE1500 / NumTstTakr"

Description of columns:

original_column_name: The column's original name in the database.
column_name: An alias for the column. This parameter is optional.
column_description: A description of the column. Use Chinese or English.
data_format: The data type of the column, such as string, integer, or float. This parameter is optional.
value_description: A description of the column's values. Use Chinese or English. If this is similar to the column description, you only need to fill in one of the two parameters.

2. Configure prompts

In the lower-left corner of the Data Analysis page, customize the prompts used to generate SQL and synthesize the final response.

If you have custom requirements, you can modify the content between the dotted lines in the Prompt Template. The content within {} (half-width curly braces) outside the dotted lines represents input parameters and must be retained. At the bottom left of the page, there are three tabs: Nl2sql Prompt Template, Synthesizer Prompt Template, and Reset Prompt Template. You can configure the corresponding prompt content on each tab.
To restore a default template, go to the Reset Prompt Template tab and click either Reset Nl2sql Prompt Template or Reset Synthesizer Prompt Template.

3. Query results

In the chatbot panel on the right, enter a question. The system displays the generated SQL statement and its syntax validity. Note that valid syntax does not guarantee correct business logic. Use these references as a debugging tool to optimize query results.

Conversational memory is enabled by default. For example, if you ask "How many cats are in the table?", the system automatically queries the relevant data table, such as pets, generates an SQL statement, executes the query, and returns a result such as "There is 1 cat in the table." The references section displays the queried table name, the generated SQL statement, and its syntax validation result to help you debug and optimize queries.

4. Optimize query results

If the query results do not meet your expectations, try the following optimization methods:

Issue	Optimization method
The target database has many tables, or the references show that the query used the wrong table.	Limit the table scope: Explicitly specify the table names to query. Add column descriptions: Supplement the column descriptions in the Table description field to help the LLM better understand the relationships and functions of different tables. For more information, see Basic database configuration.
The column names in the database are abstract, such as single letters or domain-specific jargon.	Add column descriptions to help the LLM understand domain-specific knowledge: Add comments directly in the database. Add them in dictionary format in the Table description field. For more information, see Basic database configuration. Upload a CSV file with table descriptions by using the enhancements for large databases. For more information, see Enhancements for large databases.
The references show that the generated SQL statement does not meet certain business logic requirements or the logic itself is complex.	Configure prompts: If the business logic is relatively common, add hints or examples about the logic to the prompt template. For more information, see 2. Configure prompts. Use query history/examples: Upload sample query-SQL pairs to improve generation quality. For more information, see Enhancements for large databases.
The target database has a large number of table columns, such as more than 50 total columns.	Add column descriptions: Further optimize the column descriptions for each table. For more information, see Basic database configuration. Use retrieval optimization: Enable Optimize database retrieval with vector embedding and Select tables and columns with LLM. These enhancements reduce interference in the prompt template. For more information, see Enhancements for large databases.
The references show that the generated SQL statement contains non-SQL content, such as an extra `sql` prefix. This can be an instruction-following issue with some smaller models.	Configure prompts: Add a simple constraint in the Nl2sql Prompt Template.
The SQL is correct, but you need a more personalized answer, for example, a specific tone of voice or the inclusion of units.	Configure prompts: Adjust the prompt in the Synthesizer Prompt Template.

Configure table file analysis

On the Data Analysis tab in the RAG service web UI, select datafile as the data type. In the Upload a csv/xlsx file for data analysis section, follow the on-screen instructions to upload a table file in Excel or CSV format. After the file is uploaded, a preview of its first few rows appears in the lower-left corner.

Ask questions directly in the chatbot on the right. To analyze a different file, simply upload the new file.

FAQ

Database connection failed

Network and connection issues

Incorrect credentials: Verify that the hostname, port, username, and password are correct.
Firewall and security group settings: Confirm that your firewall and security group rules allow traffic on the specified port.
VPC and private network settings: If the RAG service and the database are in the same Virtual Private Cloud (VPC), use the private address. If public access is required, ensure the RAG service's VPC has public network access.

Permission and authentication issues

User permissions: Confirm that the user has sufficient permissions to access the target database. If necessary, use the GRANT command to grant the required permissions.
Allowlist settings: Ensure the client's IP address is on the database's access allowlist.

Database service status

MySQL service status: Check the MySQL service status to ensure it is running.
Listening port: Confirm that the MySQL service is listening on the correct port.