Automated data analysis

更新时间:
复制 MD 格式

PAI-RAG provides an automated data analysis feature that supports two data types: database (a connected MySQL database) and datafile (an uploaded table file). By leveraging a large language model (LLM), you can ask questions and optimize query results. This topic explains how to use the automated data analysis feature in PAI-RAG.

Prerequisites

You have deployed a RAG service and selected Official Image > pai-rag:0.3.1 as the image. For more information, see Deploy a custom RAG service.

Step 1: Configure the LLM

  1. On the Elastic Algorithm Service (EAS) console, click the name of your target RAG service, and then click View Web Application in the upper-right corner to open the web UI.

  2. On the System Settings page, configure the LLM service and click Save Model Configuration.

    Configure one of the following two types of LLM services:

    Service type

    Parameter

    Description

    Open-source large language model deployed on Elastic Algorithm Service (EAS)

    URL

    Configure the access endpoint (EAS_SERVICE_URL) and token (EAS_TOKEN).

    Important
    • To connect to the LLM service using a private address, the RAG service and the LLM service must be in the same Virtual Private Cloud (VPC).

    • To connect to the LLM service using a public address, you must configure a VPC with public network access for the RAG service. For more information, see Access public or private resources from EAS.

    To obtain this information, perform the following steps:

    1. Deploy an LLM.

    2. On the Elastic Algorithm Service (EAS) page, click the name of the deployed LLM service. In the Basic Information section, click View Invocation Information.

    key

    Model name

    Enter the model type that you selected when you deployed the LLM service, such as Qwen2.5-7B-Instruct.

    Model Studio service

    Note
    • To use a Model Studio model, you must enable public or private network access for the RAG service. For more information, see Access public or private resources from EAS.

    • Calls to Model Studio models are billed separately. For more information, see Billing.

    URL

    Set this parameter to https://dashscope.aliyuncs.com/compatible-mode/v1.

    key

    Obtain your API key from the API Key Management page.

    Model name

    Supported models include qwen-max and qwen-turbo. We recommend qwen-max.

Step 2: Analyze data

In the RAG service web UI, switch to the Data Analysis tab.

PAI-RAG supports two data types for analysis: database (a connected MySQL database) and datafile (an uploaded table file).

Configure database analysis

1. Connect to a database

Only MySQL databases are supported. This topic uses an RDS for MySQL as an example, but you can use other MySQL databases.

After you prepare your database, go to the Data Analysis tab in the RAG service web UI, select database as the data type, configure the following parameters, and then click Load Database Information to connect. To update the description, modify it and load the information again.

  • Basic database configuration

    Parameter

    Description

    Database type

    The value is mysql and cannot be changed.

    Database port

    The default value is 3306.

    Database host

    Enter the private or public address of the database. For an RDS for MySQL instance, see View and manage instance connection addresses and ports.

    Note
    • If you use a private address, the RAG service and the RDS instance must be in the same VPC.

    • If you use a public address, you must configure a VPC with public network access for the RAG service. For more information, see Access public or private resources from EAS.

    username

    The username for the database account.

    password

    The password for the database account.

    Database name

    The name of the target database.

    Table names

    The names of the tables to analyze. Separate multiple table names with a comma (,), for example, table_A,table_B. If you leave this parameter empty, the service uses all tables in the database.

    Important

    The MySQL database must contain at least one table. Otherwise, the database information cannot be loaded.

    Table description

    Optional. Add supplementary column descriptions. This parameter is empty by default. Add a general description for a table or detailed descriptions for specific columns. Use the following format: {"table_A":"Column a represents xxx, and the data in column b is in xxx format","table_B":"This table is mainly used for xxx"}. You must use English punctuation, such as double quotation marks (""), colons (:), and commas (,).

    This feature is for temporary debugging and reviewing data analysis results. If you have extensive descriptions, we recommend one of the following methods:

    • Add the descriptions as comments in your database to make them permanent.

    • Organize the descriptions into a CSV file and upload it. For more information, see Configure table file analysis.

  • Enhancements for large databases

    Parameter

    Description

    Optimize database retrieval with vector embedding

    • Use case: Recommended when your database tables have many columns, for example, if the total number of columns across all tables is greater than 50.

    • How it works: This feature performs vector retrieval on schema and value information based on your question to quickly filter for potentially useful tables and columns. This prevents the prompt sent to the LLM from exceeding length limits or degrading SQL generation quality. The value-based retrieval fetches unique values from text columns in the database, with limits on the maximum number of columns and unique values per column. A higher limit widens the value range and improves SQL accuracy but increases latency.

    • How to use: Select Yes and configure the following parameters:

      • Maximum Number of Columns: The maximum number of table columns to include in the retrieval.

      • Maximum Number of Unique Values: The maximum number of unique values per column to include in the retrieval. A larger number may increase processing time.

    Select tables and columns with LLM

    • Use case: Suitable for databases with many tables and columns.

    • How it works: The LLM filters for relevant table columns based on your query.

    • How to use: Select Yes.

    Use database query history/examples

    • Use case: Suitable for scenarios where you want to leverage historical or similar queries to improve SQL generation.

    • How it works: This feature uses vector retrieval to find historical or similar queries that closely match the current query. These query-SQL pairs act as references, similar to an FAQ, to help improve the quality of the generated SQL.

    • How to use: Select Yes and upload a JSON file of query-SQL pairs in the Upload Q-SQL JSON File section. The JSON file must use the following format:

      [
        {
          "query": "Find the number of pets with a weight greater than 10.",
          "SQL": "SELECT count(*) FROM pets WHERE weight  >  10"
        },
        {
          "query": "Find the maximum weight for each pet type. List the maximum weight and the pet type.",
          "SQL": "SELECT max(weight) ,  petType FROM pets GROUP BY petType"
        },
        {
          "query": "Find the number of pets owned by students older than 20.",
          "SQL": "SELECT count(*) FROM student AS T1 JOIN has_pet AS T2 ON T1.stuid  =  T2.stuid WHERE T1.age  >  20"
        }
      ]

    Upload data table descriptions

    • How it works: Providing descriptions for all columns in each table helps the LLM better understand the meaning of the columns, which improves SQL generation.

    • How to use: Select Yes and upload a CSV description file in the Upload Database Description File section. The CSV filename must match the target table name. Upload description files for multiple tables. For the required format, see Data format in the CSV file.

    Data format in the CSV file

    original_column_name

    column_name

    column_description

    data_format

    value_description

    cds

    California Department Schools

    text

    useless

    rtype

    rtype

    text

    useless

    sname

    school name

    school name

    text

    dname

    district name

    district name

    text

    cname

    county name

    county name

    text

    enroll12

    enrollment (1st-12th grade)

    enrollment (1st-12th grade)

    integer

    NumTstTakr

    Number of Test Takers

    Number of Test Takers in this school

    integer

    number of test takers in each school

    AvgScrRead

    average score

    average score

    integer

    average score for reading

    NumGE1500

    Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500

    Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500

    integer

    Number of Test Takers Whose Total SAT Scores Are Greater or Equal to 1500 \n\n common sense evidence: \n Excellence Rate = NumGE1500 / NumTstTakr"

    Description of columns:

    • original_column_name: The column's original name in the database.

    • column_name: An alias for the column. This parameter is optional.

    • column_description: A description of the column. Use Chinese or English.

    • data_format: The data type of the column, such as string, integer, or float. This parameter is optional.

    • value_description: A description of the column's values. Use Chinese or English. If this is similar to the column description, you only need to fill in one of the two parameters.

2. Configure prompts

In the lower-left corner of the Data Analysis page, customize the prompts used to generate SQL and synthesize the final response.

  • If you have custom requirements, you can modify the content between the dotted lines in the Prompt Template. The content within {} (half-width curly braces) outside the dotted lines represents input parameters and must be retained. At the bottom left of the page, there are three tabs: Nl2sql Prompt Template, Synthesizer Prompt Template, and Reset Prompt Template. You can configure the corresponding prompt content on each tab.

  • To restore a default template, go to the Reset Prompt Template tab and click either Reset Nl2sql Prompt Template or Reset Synthesizer Prompt Template.

3. Query results

In the chatbot panel on the right, enter a question. The system displays the generated SQL statement and its syntax validity. Note that valid syntax does not guarantee correct business logic. Use these references as a debugging tool to optimize query results.

Conversational memory is enabled by default. For example, if you ask "How many cats are in the table?", the system automatically queries the relevant data table, such as pets, generates an SQL statement, executes the query, and returns a result such as "There is 1 cat in the table." The references section displays the queried table name, the generated SQL statement, and its syntax validation result to help you debug and optimize queries.

4. Optimize query results

If the query results do not meet your expectations, try the following optimization methods:

Issue

Optimization method

The target database has many tables, or the references show that the query used the wrong table.

  • Limit the table scope: Explicitly specify the table names to query.

  • Add column descriptions: Supplement the column descriptions in the Table description field to help the LLM better understand the relationships and functions of different tables. For more information, see Basic database configuration.

The column names in the database are abstract, such as single letters or domain-specific jargon.

Add column descriptions to help the LLM understand domain-specific knowledge:

  • Add comments directly in the database.

  • Add them in dictionary format in the Table description field. For more information, see Basic database configuration.

  • Upload a CSV file with table descriptions by using the enhancements for large databases. For more information, see Enhancements for large databases.

The references show that the generated SQL statement does not meet certain business logic requirements or the logic itself is complex.

  • Configure prompts: If the business logic is relatively common, add hints or examples about the logic to the prompt template. For more information, see 2. Configure prompts.

  • Use query history/examples: Upload sample query-SQL pairs to improve generation quality. For more information, see Enhancements for large databases.

The target database has a large number of table columns, such as more than 50 total columns.

  • Add column descriptions: Further optimize the column descriptions for each table. For more information, see Basic database configuration.

  • Use retrieval optimization: Enable Optimize database retrieval with vector embedding and Select tables and columns with LLM. These enhancements reduce interference in the prompt template. For more information, see Enhancements for large databases.

The references show that the generated SQL statement contains non-SQL content, such as an extra sql prefix. This can be an instruction-following issue with some smaller models.

Configure prompts: Add a simple constraint in the Nl2sql Prompt Template.

The SQL is correct, but you need a more personalized answer, for example, a specific tone of voice or the inclusion of units.

Configure prompts: Adjust the prompt in the Synthesizer Prompt Template.

Configure table file analysis

On the Data Analysis tab in the RAG service web UI, select datafile as the data type. In the Upload a csv/xlsx file for data analysis section, follow the on-screen instructions to upload a table file in Excel or CSV format. After the file is uploaded, a preview of its first few rows appears in the lower-left corner.

Ask questions directly in the chatbot on the right. To analyze a different file, simply upload the new file.

FAQ

Database connection failed

Network and connection issues

  • Incorrect credentials: Verify that the hostname, port, username, and password are correct.

  • Firewall and security group settings: Confirm that your firewall and security group rules allow traffic on the specified port.

  • VPC and private network settings: If the RAG service and the database are in the same Virtual Private Cloud (VPC), use the private address. If public access is required, ensure the RAG service's VPC has public network access.

Permission and authentication issues

  • User permissions: Confirm that the user has sufficient permissions to access the target database. If necessary, use the GRANT command to grant the required permissions.

  • Allowlist settings: Ensure the client's IP address is on the database's access allowlist.

Database service status

  • MySQL service status: Check the MySQL service status to ensure it is running.

  • Listening port: Confirm that the MySQL service is listening on the correct port.